Abstract
In this article, I have investigated the performance of the bisect K-means clustering algorithm compared to the standard K-means algorithm in the analysis of Arabic documents. The experiments included five commonly used similarity and distance functions (Pearson correlation coefficient, cosine, Jaccard coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) and three leading stemmers. Using the purity measure, the bisect K-means clearly outperformed the standard K-means in all settings with varying margins. For the bisect K-means, the best purity reached 0.927 when using the Pearson correlation coefficient function, while for the standard K-means, the best purity reached 0.884 when using the Jaccard coefficient function. Removing stop words significantly improved the results of the bisect K-means but produced minor improvements in the results of the standard K-means. Stemming provided additional minor improvement in all settings except the combination of the averaged Kullback-Leibler divergence function and the root-based stemmer, where the purity was deteriorated by more than 10%. These experiments were conducted using a dataset with nine categories, each of which contains 300 documents.
- D. Abuaiadah, J. El Sana, and W. Abusalah. 2014. On the impact of dataset characteristics on arabic document classification. International Journal of Computer Applications 101, 7, 31--38.Google Scholar
Cross Ref
- E. Al-Shammari and J. Lin. 2008. Towards an error-free arabic stemming. In Proceedings of the 2nd ACM workshop on Improving Non English Web Searching. Google Scholar
Digital Library
- F. Archetti, P. Campanelli, E. Fersini, and E. Messina. 2006. A Hierarchical Document Clustering Environment Based on the Induced Bisecting k-Means. Springer, Berlin.Google Scholar
- P. Berkhin. 2001. Survey of Clustering Data Mining Techniques. Retrieved from http://www.accrue.com/products/rp_cluster_review.pdf.Google Scholar
- Q. Bsoul and M. Mohd. 2011. Effect of ISRI stemming on similarity measure for Arabic document clustering. In Proceedings of the 7th Asia Conference on Information Retrieval Technology (AIRS’11). 7097, 584--593. Google Scholar
Digital Library
- R. Cathey, E. Jensen, S. Beitzel, O. Frieder, and D. Grossman. 2007. Exploiting parallelism to support scalable hierarchical clustering. Journal of the American Society for Information Science and Technology 58, 8, 1207--1221. Google Scholar
Digital Library
- A. Chen and F. Gey. 2002. Building an Arabic stemmer for information retrieval. In NIST Special Publication 500-251: Proceedings of the 11th Text Retrieval Conference (TREC’02). Retrieved from http://trec.nist.gov/pubs/trec11/papers/ucalberkeley.chen.pdf.Google Scholar
- E. Dang, R. Luk, K. Ho, S. Chan, and D. Lee. 2008. A new measure of clustering effectiveness: Algorithms and experimental studies. Journal of the American Society for Information Science and Technology 59, 3, 390--406. Google Scholar
Digital Library
- T. El-Shishtawy and F. El-Ghannam. 2012. An accurate arabic root-based lemmatizer for information retrieval purposes. arXiv preprint arXiv:1203.3584.Google Scholar
- M. Fahim, M. Salem, A. Torkey, and A. Ramadan. 2006. An efficient enhanced k-means clustering algorithm. Journal of Zhejiang University Science A 7, 10, 1626--1633.Google Scholar
Cross Ref
- H. Froud, A. Lachkar, and S. Ouatik. 2013a. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering. International Journal of Data Mining & Knowledge Management Process (IJDKP). Vol. 3, 79--95.Google Scholar
Cross Ref
- H. Froud and A. Lachkar. 2013. Agglomerative hierarchical clustering techniques for arabic documents. In Advances in Computational Science, Engineering and Information Technology. Springer International Publishing, 255--267.Google Scholar
- H. Froud, I. Sahmoudi, and A. Lachkar. 2013b. An efficient approach to improve arabic documents clustering based on a new keyphrases extraction algorithm. In Proceedings of the 2nd International Conference on Advanced Information Technologies and Applications.Google Scholar
- L. Fu, D.-L. Goh, and S.-B. Foo. 2004. The effect of similarity measures on the quality of query clusters. Journal of Information Science 30, 5, 396--407.Google Scholar
Cross Ref
- L. Gang and L. Fei. 2012. Application of a clustering method on sentiment analysis. Journal of Information Science 38, 2, 127--139. Google Scholar
Digital Library
- O. Ghanem and W. Ashour. 2012. Stemming effectiveness in clustering of arabic documents. International Journal of Computer Applications 49, 5.Google Scholar
Cross Ref
- A. Huang. 2008. Similarity measures for text document clustering. In Proceedings of the 6th New Zealand Computer Science Research Student Conference (NZCSRSC’08), 45--56.Google Scholar
- L. Huang, M. Milne, E. Frank, and I. Witten. 2012. Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology 63, 8, 1593--1608. Google Scholar
Digital Library
- A. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, 651--666. Google Scholar
Digital Library
- R. Kashef and M. S. Kamel. 2009. Enhanced bisecting k-means clustering using intermediate cooperation. Pattern Recognition 42, 11, 2557--2569. Google Scholar
Digital Library
- S. Khoja and R. Garside. 1999. Stemming Arabic Text. Lancaster University, Department of Computer Science, Lancaster University.Google Scholar
- K. Kishida. 2010. High-speed rough clustering for very large document collections. Journal of the American Society for Information Science and Technology 61, 6, 1092--1104. Google Scholar
Digital Library
- L. Larkey, L. Ballesteros, and M. Connell. 2007. Light stemming for Arabic information retrieval. Arabic Computational Morphology, Speech and Language Technology 38, 221--243.Google Scholar
Cross Ref
- C. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval, Vol. 1. Cambridge University Press, Cambridge. Google Scholar
Digital Library
- K. Murugesan and C. Zhang. 2011. Hybrid bisect k-means clustering algorithm. In Proceedings of the 2011 International Conference on Business Computing and Global Informatization (BCGIN’11). IEEE, 216--219. Google Scholar
Digital Library
- A. Newsri. 2008. Effective Retrieval Techniques for Arabic Text. Ph.D. dissertation, RMIT University, Melbourne, Australia.Google Scholar
- J. Peña, J. Lozano, and P. Larrañaga. 1999. An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters 20, 6. Google Scholar
Digital Library
- G. Salton. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman, Boston. Google Scholar
Digital Library
- G. Salton and C. Buckley. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513--523. Google Scholar
Digital Library
- M. Steinbach, G. Karypis, and V. Kumar. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining.Google Scholar
- T. Tarczynski. 2011. Document clustering-concepts, metrics and algorithms. International Journal of Electronics and Telecommunications 57, 3, 271--277.Google Scholar
Cross Ref
- R. Xu and D. Wunsch. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16, 645--678. Google Scholar
Digital Library
Index Terms
Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents
Recommendations
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Initializing K-means Clustering Using Affinity Propagation
HIS '09: Proceedings of the 2009 Ninth International Conference on Hybrid Intelligent Systems - Volume 01K-means clustering is widely used due to its fast convergence, but it is sensitive to the initial condition.Therefore, many methods of initializing K-means clustering have been proposed in the literatures. Compared with Kmeans clustering, a novel ...
Ensemble-Initialized k-Means Clustering
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and ComputingAs one of the most classical clustering techniques, the k-means clustering has been widely used in various areas over the past few decades. Despite its significant success, there are still several challenging issues in the k-means clustering research, ...






Comments