Abstract
Topic Segmentation is one of the pillars of Natural Language Processing. Yet there is a remarkable research gap in this field, as far as the Arabic language is concerned. The purpose of this article is to improve Arabic Topic Segmentation (ATS) by inquiring into two segmenters: ArabC99 and ArabTextTiling. This study is carried out on two independent levels: the pre-processing level and the segmentation level. These levels represent the basic steps of topic segmentation. On the pre-processing level, we examine the effect of using different Arabic stemming algorithms on ATS. We find out that Light10 is more appropriate for the pre-processing step. Based on this conclusion, we proceed to the second level by proposing two Arabic segmenters called ArabC99-LS-LSA and ArabTextTiling-LS-LSA. These latter use external semantic knowledge related to the Latent Semantic Analysis (LSA). Based on the evaluation results, we notice that LSA provides improvements in this field. Hence, the main outcome of this article emphasizes the multilevel improvement of ATS based on Light10 and LSA.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, The Contribution of Stemming and Semantics in Arabic Topic Segmentation
- A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak. 2016. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 11--16.Google Scholar
- D. Abuaiadah. 2015. Using bisect k-means clustering technique in the analysis of arabic documents. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 3, Article 17 (Dec. 2015), 13 pages. Google Scholar
Digital Library
- F. S. Al-Anzi and D. AbuZeina. 2017. Toward an enhanced Arabic text classification using cosine similarity and latent semantic indexing. J. King Saud Univ.-Comput. Inf. Sci. 29, 2 (2017), 189--195.Google Scholar
- E. AlShawakfa, A. AlBadarneh, S. Shatnawi, K. Al-Rabab'ah, and B. Bani-Ismail. 2010. A comparison study of some arabic root finding algorithms. J. Am. Soc. Inf. Sci. Technol. 6, 5 (2010), 1015--1024, 2010 Google Scholar
Digital Library
- N. Azizi and N. Farah. 2012. From static to dynamic ensemble of classifiers selection: Application to Arabic handwritten recognition. Int. J. Knowl.-Based Intellig. Eng. Syst. 16, 4 (2012), 279--288. Google Scholar
Digital Library
- A. Basu, I. R. Harris, and S. Basu. 1997. Minimum distance estimation: The approach using density-based distances. In Handbook of Statistics, G. S. Maddala and C. R. Rao (Eds.), 15, 21--48. North--Holland.Google Scholar
- D. Beeferman, A. Berger, and J. Lafferty. 1999. Statistical models for text segmentation. Mach Learn. 34, 1 (1999), 177--210. Google Scholar
Digital Library
- S. Ben Guirat, I. Bounhas, and Y. Slimani. 2016. A hybrid model for arabic document indexing. In Proceedings of the 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD’16).Google Scholar
- Y. Bestgen. 2006. Improving text segmentation using latent semantic analysis: A reanalysis of choi, wiemer-hastings and moore. Comput. Ling. 32, 5 (2006), 12. Google Scholar
Digital Library
- Y. Bestgen and S. Pierard. 2006. Comment evaluer les algorithmes de segmentation thematique? essai de construction d'un mmateriel de reference. Traitement Automatique Des Langues Naturelles (TALN’06). 407--414.Google Scholar
- D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google Scholar
Digital Library
- M. Boudchiche, A. Mazroui, M. O. A. O. Bebah, A. Lakhouaja, and A. Boudlal. 2016. AlKhalil morpho sys 2: A robust arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf. Sci.Google Scholar
- M. Boudchiche, A. Mazroui, M. O. A. O. Bebah, A. Lakhouaja, and A. Boudlal. 2017. AlKhalil morpho sys 2: A robust Arabic morpho-syntactic analyzer. J. King Saud Univ.-Comput. Inf. Sci. 29, 2 (2017), 141--146. Google Scholar
Digital Library
- A. Boudlal, A. Lakhouaja, A. Mazroui, A. Meziane, M. Ould Abdallahi Ould Bebah, and M. Shoul. 2010. alkhalil morpho SYS1: A morphosyntactic analysis system for arabic texts. In International Arab Conference on Information Technology. 1--6.Google Scholar
- T. Brants, F. Chen, and A. Farahat. 2002. Arabic document topic analysis. In Proceedings of the Workshop on Arabic Language Resources and Evaluation (LREC'02).Google Scholar
- T. Buckwalter. 2004. Buckwalter arabic morphological analyzer version 2.0, Linguistic Data Consortium Catalogue Number LDC2004L02.Google Scholar
- F. Y. Y. Choi. 2000. Advances in domain independent linear text segmentation. In Proceedings of Conference of the Association for Computational Linguistics (NAACL’00). 26--33. Google Scholar
Digital Library
- F. Y. Y. Choi, P. Wiemer-Hastings, and J. Moore. 2001. Latent semantic analysis for text segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language (EMNLP’01).Google Scholar
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391--407.Google Scholar
Cross Ref
- L. Du, L. Wray, and J. Mark. 2013. Topic segmentation with a structured topic model. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’13).Google Scholar
- S. Dumais. 1992. Enhancing performance in latent semantic indexing (lsi) retrieval. Technical Report TM-ARH017527, Bellcore, Morristown, NJ.Google Scholar
- J. Eisenstein and R. Barzilay. 2008. Bayesian unsupervised topic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Google Scholar
Digital Library
- M. I. Eldesouki, W. M. Arafa, and K. M. Darwish. 2009. Stemming techniques of Arabic Language: Comparative study from the information retrieval perspective. Egypt. Comput. J. 36, 1, 30--49.Google Scholar
- M. A. El-Shayeb, S. R. El-Beltagy, and A. Rafea. 2007. Comparative analysis of different text segmentation algorithms on arabic news stories. In Proceedings of the IEEE International Conference on Information Reuse and Integration. 441--446.Google Scholar
- O. Ferret. 2002. Using collocations for topic segmentation and link detection. In Proceedings of the 19th International Conference on Computational Linguistics (COLING’02). 260--266. Google Scholar
Digital Library
- O. Ferret. 2009. Improving text segmentation by combining endogenous and exogenous methods. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’09). 88--93.Google Scholar
- P. Fragkou, V. Petridis, and K. Ath. 2004. A dynamic programming algorithm for linear text segmentation. Intell. Inf. Syst. 23, 2 (2004), 179--197. Google Scholar
Digital Library
- A. Farzindar and G. Lapalme. 2004. Legal text summarization by exploration of the thematic structures and argumentative roles. In Proceedings of the Workshop on Text Summarization Branches Out (ACL’04).Google Scholar
- H. Froud, A. Lachkar, and S. A. Ouatik. 2012. Stemming versus light stemming for measuring the similarity between arabic words with latent semantic analysis model. In Proceedings of the Information Science and Technology Conference. 69--73.Google Scholar
- S. Ghwanmeh, S. Rabab'ah, R. Al-Shalabi, and G. Kanaan. 2009. Enhanced algorithm for extracting the root of arabic words. In Proceedings of the 6th International Conference on Computer Graphics, Imaging and Visualization. IEEE Computer Society, 388--391. Google Scholar
Digital Library
- J. B. Guillermo, A. L. Jose, O. Ricardo, and E. Inmaculada. 2010. Latent semantic analysis parameters for essay evaluation using small-scale corpora. J. Quant. Ling. 17, 1 (2010), 1--29.Google Scholar
Cross Ref
- A. C. Habacha, M. Naili, and S. Sammoud. 2014. Topic segmentation for textual document written in arabic language. KES-2014 Gdynia, Poland, September'14, Procedia Computer Science, 35, 437--446.Google Scholar
- F. Harrag, A. H. Cherif, and A. S. Al-Salman. 2010. Comparative study of topic segmentation algorithms based on lexical cohesion: Experimental results on arabic language. Arab. J. Sci. Eng. 35, 2C (2010), 33--64.Google Scholar
- F. Harrag, A. H. Cherif, and B. Mohamed. 2011. Evaluation of lexical cohesion algorithms for arabic topic segmentation. RIST, 18, 1 (2011), 103--116.Google Scholar
- M. A. Hearst. 1997. Texttiling: Segmenting text into multi-paragraph subtopic passages. Comput. Ling. 23, 1 (1997), 33--64. Google Scholar
Digital Library
- M. M. Islam and A. S. M. Hoque. 2012. Automated essay scoring using generalized latent semantic analysis. J. Comput. 7, 3 (2012), 616--626.Google Scholar
Cross Ref
- S. Khoja and R. Garside. 2001. Automatic tagging of an arabic corpus using APT. Ph.D. thesis, University of Utah, Salt Lake City, Utah.Google Scholar
- S. S. Kulkarni, U. M. Apte, and N. E. Evangelopoulos. 2014. The use of latent semantic analysis in operations management research. Decis. Sci. 45, 5 (2014), 971--994.Google Scholar
Cross Ref
- A. Kundu, V. Jain, S. Kumar, and C. Chandra. 2015. A journey from normative to behavioral operations in supply chain management: A review using latent semantic analysis. Expert Syst. Appl. 42, 2 (2015), 796--809. Google Scholar
Digital Library
- A. Labadie and V. Prince. 2008. Lexical and semantic methods in inner text topic segmentation: A comparison between c99 and transeg. Lecture Notes in Computer Science, vol. 5039. 347--349. Google Scholar
Digital Library
- L. Larkey, L. Ballesteros, and M. Connell. 2007. Light stemming for arabic information retrivial. Arabic Computational Morphology, 38, 221--243.Google Scholar
Cross Ref
- T. Magerman, B. Van Looy, and X. Song. 2010. Exploring the feasibility and accuracy of latent semantic analysis based text mining techniques to detect similarity between patent documents and scientific publications. Scientometrics 82, 2 (2010), 289--306.Google Scholar
Cross Ref
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR, arXiv preprint arXiv:1301.3781.Google Scholar
- H. Misra, F. Yvon, J. M. Jose, and O. Cappe. 2009. Text segmentation via topic modeling: an analytical study. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 1553--1556. Google Scholar
Digital Library
- M. Naili, A. C. Habacha, and H. H. Ben Ghezala. 2016a. Parameters driving effectiveness of LSA on topic segmentation. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics CICLing. Lecture Notes in Computer Science. Springer.Google Scholar
- M. Naili, A. C. Habacha, and H. H. Ben Ghezala. 2016b. Exogenous approach to improve topic segmentation. Int. J. Intell. Comput. Cybernet. 9, 2 (2016), 165--178.Google Scholar
Cross Ref
- M. Naili, A. C. Habacha, and H. H. Ben Ghezala. 2016c. Empirical study of LDA for arabic topic identification. In Proceedings of the 13th African Conference on Research in Computer Science and Applied Mathematics (CARI). 138--145.Google Scholar
- P. Nakov, E. Valchanova, and G. Angelova. 2003. Towards deeper understanding of the lsa performance. In Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP'03).Google Scholar
- R. Olmos, J. A. Leon, G. Jorge-Botana, and I. Escudero. 2013. Using latent semantic analysis to grade brief summaries: A study exploring texts at different academic levels. Lit. Ling. Comput. 28, 3 (2013), 388--403.Google Scholar
Cross Ref
- M. A. Otair. 2013. Comparative analysis of arabic stemming algorithms. Int. J. Manag. Inf. Technol. 5, 2 (2013), 1--12.Google Scholar
- A. Pasha, M. Al-Badrashiny, M. T. Diab, A. El Kholy, R. Eskander, N. Habash, and R. Roth. 2014, May. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14). 1094--1101.Google Scholar
- J. Pennington, R. Socher, and C. D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.Google Scholar
- L. Pevzner and M. A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentetion. Comput. Ling. 28, 1 (2002), 19--36. Google Scholar
Digital Library
- M. F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130--137.Google Scholar
Digital Library
- M. M. Rahman, B. C. Desai, and P. Bhattacharya. 2006. Visual keyword-based image retrieval uding latent semantic indexing. In Proceedings of the Correlation-enhanced Similarity Matching and Query Expansion in Retrieval Index (IDEAS'06). IEEE, 201--208. Google Scholar
Digital Library
- M. Reidl and C. Beimann. 2012. How text segmentation algorithms gain from topic models. In Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’12). 553--557. Google Scholar
Digital Library
- J. C. Reynar. 1980. Topic Segmentation: Algorithms and Applications, Ph.D. thesis, University of Pennsylvania.Google Scholar
- A. Rosenberg and J. Hirschberg. 2006. Story segmentation of broadcast news in English, mandarin and arabic. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. Association for Computational Linguistics. Google Scholar
Digital Library
- A. Simon, G. Gravier, and P. Sébillot. 2013. Un modèle segmental probabiliste combinant cohésion lexicale et rupture lexicale pour la segmentation thématique. In 20e Conférence Traitement Automatique Des Langues Naturelles, 20, 202--214.Google Scholar
- N. Soudani, I. Bounhas, and Y. Slimani. 2016. Semantic information retrieval: A comparative experimental study of NLP tools and language resources for arabic. In Proceedings of the 28th International Conference on Tools with Artificial Intelligence (ICTAI’16).Google Scholar
- S. Strassel and M. Glenn. 2003. Creating the annotated tdt-4 y2003 evaluation corpus. Retrieved from http://www.nist.gov/speech/tests/tdt/tdt2003/papers/ldc.ppt.Google Scholar
- K. Taghva, R. Elkhoury, and J. Coombs. 2005. Arabic stemming without a root dictionary. Int. Conf. Inf. Technol. Coding Comput. 1, 52--157. Google Scholar
Digital Library
- A. A. Touir, H. Makhtour, and W. Al-Sanea. 2008. Semantic-based segmentation of arabic texts, inf. Tech. J. 7, 7 (2008), 1009--1015.Google Scholar
- X. Wang, J. T. Sun, Z. Chen, and C. Zhai. 2006. Latent semantic analysis for multiple-type interrelated data objects. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 236--243. Google Scholar
Digital Library
- F. Wild. 2015. Package “lsa”. Retrieved from https://cran.r-project.org/web/packages/lsa/lsa.pdf.Google Scholar
- F. Wild, C. Stahl, G. Stermsek, Y. K. Penya, and G. Neumann. 2005. Factors influencing effectiveness in automated essay scoring with LSA, in artificial intelligence in education-supporting learning through intelligent and socially informed technology. In Proceedings of the 12th International Conference on Artificial Intelligence in Education (AIED'05). 947--949. Google Scholar
Digital Library
- F. Wild, D. Haley, and K. Bülow. 2011. Using latent-semantic analysis and network analysis for monitoring conceptual development. J. Lang. Technol. Comput. Ling. 26, 1 (2011), 9--21.Google Scholar
- J. Y. Yeh, H. R. Ke, W. P. Yang, and I. H. Meng. 2005. Text summarization using a trainable summarizer and latent semantic analysis. Inf. Process. Manage. 41, 1 (2005), 75--95. Google Scholar
Digital Library
- M. Yalcinkaya and V. Singh. 2015. Patterns and trends in building information modeling (BIM) research: A latent semantic analysis. Autom. Construct. 59 (2015), 68--80.Google Scholar
Cross Ref
- S. Yu, D. Cai, J. R. Wen, and W. Y. Ma. 2003. Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In Proceedings of the International World Wide Web Conference (WWW’03). Google Scholar
Digital Library
- S. Zelikovitz and F. Marquez. 2005. Transductive learning for short-text classification problems using latent semantic indexing. Int. J. Pattern Recogn. Artif. Intell. 19, 2 (2005), 143--163.Google Scholar
Cross Ref
- T. Zerrouki. 2010. Tashaphyne, arabic light stemmer/segment. Retrieved from http://tashaphyne.sourceforge.net.Google Scholar
Index Terms
The Contribution of Stemming and Semantics in Arabic Topic Segmentation
Recommendations
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalArabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence ...
Comparative Study of Arabic Stemming Algorithms for Topic Identification
AbstractStemming process is one of the important pre-processing steps in different natural language process tasks such as text mining and information retrieval. Yet, stemming process can be considered as a difficult step to realize according to the used ...
A survey on Arabic character segmentation
Arabic character segmentation is a necessary step in Arabic Optical Character Recognition (OCR). The cursive nature of Arabic script poses challenging problems in Arabic character recognition; however, incorrectly segmented characters will cause ...






Comments