skip to main content
research-article

Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

Published:28 January 2016Publication History
Skip Abstract Section

Abstract

In this article, I have investigated the performance of the bisect K-means clustering algorithm compared to the standard K-means algorithm in the analysis of Arabic documents. The experiments included five commonly used similarity and distance functions (Pearson correlation coefficient, cosine, Jaccard coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) and three leading stemmers. Using the purity measure, the bisect K-means clearly outperformed the standard K-means in all settings with varying margins. For the bisect K-means, the best purity reached 0.927 when using the Pearson correlation coefficient function, while for the standard K-means, the best purity reached 0.884 when using the Jaccard coefficient function. Removing stop words significantly improved the results of the bisect K-means but produced minor improvements in the results of the standard K-means. Stemming provided additional minor improvement in all settings except the combination of the averaged Kullback-Leibler divergence function and the root-based stemmer, where the purity was deteriorated by more than 10%. These experiments were conducted using a dataset with nine categories, each of which contains 300 documents.

References

  1. D. Abuaiadah, J. El Sana, and W. Abusalah. 2014. On the impact of dataset characteristics on arabic document classification. International Journal of Computer Applications 101, 7, 31--38.Google ScholarGoogle ScholarCross RefCross Ref
  2. E. Al-Shammari and J. Lin. 2008. Towards an error-free arabic stemming. In Proceedings of the 2nd ACM workshop on Improving Non English Web Searching. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Archetti, P. Campanelli, E. Fersini, and E. Messina. 2006. A Hierarchical Document Clustering Environment Based on the Induced Bisecting k-Means. Springer, Berlin.Google ScholarGoogle Scholar
  4. P. Berkhin. 2001. Survey of Clustering Data Mining Techniques. Retrieved from http://www.accrue.com/products/rp_cluster_review.pdf.Google ScholarGoogle Scholar
  5. Q. Bsoul and M. Mohd. 2011. Effect of ISRI stemming on similarity measure for Arabic document clustering. In Proceedings of the 7th Asia Conference on Information Retrieval Technology (AIRS’11). 7097, 584--593. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Cathey, E. Jensen, S. Beitzel, O. Frieder, and D. Grossman. 2007. Exploiting parallelism to support scalable hierarchical clustering. Journal of the American Society for Information Science and Technology 58, 8, 1207--1221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Chen and F. Gey. 2002. Building an Arabic stemmer for information retrieval. In NIST Special Publication 500-251: Proceedings of the 11th Text Retrieval Conference (TREC’02). Retrieved from http://trec.nist.gov/pubs/trec11/papers/ucalberkeley.chen.pdf.Google ScholarGoogle Scholar
  8. E. Dang, R. Luk, K. Ho, S. Chan, and D. Lee. 2008. A new measure of clustering effectiveness: Algorithms and experimental studies. Journal of the American Society for Information Science and Technology 59, 3, 390--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. El-Shishtawy and F. El-Ghannam. 2012. An accurate arabic root-based lemmatizer for information retrieval purposes. arXiv preprint arXiv:1203.3584.Google ScholarGoogle Scholar
  10. M. Fahim, M. Salem, A. Torkey, and A. Ramadan. 2006. An efficient enhanced k-means clustering algorithm. Journal of Zhejiang University Science A 7, 10, 1626--1633.Google ScholarGoogle ScholarCross RefCross Ref
  11. H. Froud, A. Lachkar, and S. Ouatik. 2013a. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering. International Journal of Data Mining & Knowledge Management Process (IJDKP). Vol. 3, 79--95.Google ScholarGoogle ScholarCross RefCross Ref
  12. H. Froud and A. Lachkar. 2013. Agglomerative hierarchical clustering techniques for arabic documents. In Advances in Computational Science, Engineering and Information Technology. Springer International Publishing, 255--267.Google ScholarGoogle Scholar
  13. H. Froud, I. Sahmoudi, and A. Lachkar. 2013b. An efficient approach to improve arabic documents clustering based on a new keyphrases extraction algorithm. In Proceedings of the 2nd International Conference on Advanced Information Technologies and Applications.Google ScholarGoogle Scholar
  14. L. Fu, D.-L. Goh, and S.-B. Foo. 2004. The effect of similarity measures on the quality of query clusters. Journal of Information Science 30, 5, 396--407.Google ScholarGoogle ScholarCross RefCross Ref
  15. L. Gang and L. Fei. 2012. Application of a clustering method on sentiment analysis. Journal of Information Science 38, 2, 127--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. O. Ghanem and W. Ashour. 2012. Stemming effectiveness in clustering of arabic documents. International Journal of Computer Applications 49, 5.Google ScholarGoogle ScholarCross RefCross Ref
  17. A. Huang. 2008. Similarity measures for text document clustering. In Proceedings of the 6th New Zealand Computer Science Research Student Conference (NZCSRSC’08), 45--56.Google ScholarGoogle Scholar
  18. L. Huang, M. Milne, E. Frank, and I. Witten. 2012. Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology 63, 8, 1593--1608. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, 651--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Kashef and M. S. Kamel. 2009. Enhanced bisecting k-means clustering using intermediate cooperation. Pattern Recognition 42, 11, 2557--2569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Khoja and R. Garside. 1999. Stemming Arabic Text. Lancaster University, Department of Computer Science, Lancaster University.Google ScholarGoogle Scholar
  22. K. Kishida. 2010. High-speed rough clustering for very large document collections. Journal of the American Society for Information Science and Technology 61, 6, 1092--1104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Larkey, L. Ballesteros, and M. Connell. 2007. Light stemming for Arabic information retrieval. Arabic Computational Morphology, Speech and Language Technology 38, 221--243.Google ScholarGoogle ScholarCross RefCross Ref
  24. C. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval, Vol. 1. Cambridge University Press, Cambridge. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Murugesan and C. Zhang. 2011. Hybrid bisect k-means clustering algorithm. In Proceedings of the 2011 International Conference on Business Computing and Global Informatization (BCGIN’11). IEEE, 216--219. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Newsri. 2008. Effective Retrieval Techniques for Arabic Text. Ph.D. dissertation, RMIT University, Melbourne, Australia.Google ScholarGoogle Scholar
  27. J. Peña, J. Lozano, and P. Larrañaga. 1999. An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters 20, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Salton. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman, Boston. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Salton and C. Buckley. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Steinbach, G. Karypis, and V. Kumar. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining.Google ScholarGoogle Scholar
  31. T. Tarczynski. 2011. Document clustering-concepts, metrics and algorithms. International Journal of Electronics and Telecommunications 57, 3, 271--277.Google ScholarGoogle ScholarCross RefCross Ref
  32. R. Xu and D. Wunsch. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16, 645--678. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Asian and Low-Resource Language Information Processing
            ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 3
            March 2016
            220 pages
            ISSN:2375-4699
            EISSN:2375-4702
            DOI:10.1145/2876004
            Issue’s Table of Contents

            Copyright © 2016 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 28 January 2016
            • Accepted: 1 August 2015
            • Revised: 1 May 2015
            • Received: 1 December 2014
            Published in tallip Volume 15, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!