ABSTRACT
In text categorization, feature selection can be essential not only for reducing the index size but also for improving the performance of the classifier. In this article, we propose a feature selection criterion, called Entropy based Category Coverage Difference (ECCD). On the one hand, this criterion is based on the distribution of the documents containing the term in the categories, but on the other hand, it takes into account its entropy. ECCD compares favorably with usual feature selection methods based on document frequency (DF), information gain (IG), mutual information (IM), χ2, odd ratio and GSS on a large collection of XML documents from Wikipedia encyclopedia. Moreover, this comparative study confirms the effectiveness of selection feature techniques derived from the χ2 statistics.
References
- M. F. Caropreso, S. Matwin, and F. Sebastiani. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In A. G. Chin, editor, Text Databases and Document Management: Theory and Practice, pages 78--102. Idea Group Publishing, Hershey, US, 2001. Google Scholar
Digital Library
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6): 391--407, 1990.Google Scholar
Cross Ref
- L. Denoyer and P. Gallinari. The Wikipedia XML corpus. SIGIR Forum, 40(1): 64--69, 2006. Google Scholar
Digital Library
- L. Denoyer and P. Gallinari. Overview of the INEX 2008 XML Mining Track. In Proceedings of the INEX Workshop INtitiative for Evaluation of XML Retrieval, pages 401--411, 2008.Google Scholar
- S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In CIKM'98: Proceedings of the 7th international conference on Information and knowledge management, pages 148--155, New York, NY, USA, 1998. ACM. Google Scholar
Digital Library
- R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9: 1871--1874, 2008. Google Scholar
Digital Library
- G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3: 1289--1305, 2003. Google Scholar
Digital Library
- L. Galavotti, F. Sebastiani, and M. Simi. Experiments on the use of feature selection and negative evidence in automated text categorization. In ECDL '00: Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries, pages 59--68. Springer-Verlag, 2000. Google Scholar
Digital Library
- J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2nd edition. Morgan Kaufman Publishers, 2006. Google Scholar
Digital Library
- B. C. How and W. T. Kiong. An examination of feature selection frameworks in text categorization. In AIRS'05: Proceedings of 2nd Asia information retrieval symposium, pages 558--564. Lecture notes in computer science, 2005. Google Scholar
Digital Library
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, ECML'98: Proceedings of the 10th European Conference on Machine Learning, pages 137--142. Springer-Verlag, Heidelberg, DE, 1998. Google Scholar
Digital Library
- D. D. Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the Speech and Natural Language Workshop, pages 212--217. Defense Advanced Research Projects Agency, Morgan Kaufmann, 1992. Google Scholar
Digital Library
- D. D. Lewis and M. Ringuette. A comparison of two learning algorithms for text categorization. In SDAIR'94: Proceedings of the Symposium on Document Analysis and Information Retrieval, pages 81--93, 1994.Google Scholar
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5: 361--397, 2004. Google Scholar
Digital Library
- Y. H. Li and A. K. Jain. Classification of text documents. The Computer Journal, 41: 537--546, 1998.Google Scholar
Cross Ref
- I. Moulinier and J.-G. Ganascia. Applying an existing machine learning algorithm to text categorization. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pages 343--354. Springer-Verlag, 1996. Google Scholar
Digital Library
- H. T. Ng, W. B. Goh, and K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In SIGIR '97: Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, pages 67--73, 1997. Google Scholar
Digital Library
- M. F. Porter. An algorithm for suffix stripping. Program, 14(3): 130--137, 1980.Google Scholar
Digital Library
- J. S. Ronen Feldman. The text mining handbook: Advanced approaches to analysing unstructured data. Cambridge University Press, Cambridge, 2007. Google Scholar
Digital Library
- G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communations of the ACM, 18(11): 613--620, 1975. Google Scholar
Digital Library
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34: 1--47, 2002. Google Scholar
Digital Library
- C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27: 379--423 and 623--656, 1948.Google Scholar
Digital Library
- V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. Google Scholar
Digital Library
- E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting. In SDAIR'95: Proceedings of the 4th Symposium on Document Analysis and Information Retrieval, pages 317--332, 1995.Google Scholar
- Y. Yang and X. Liu. A re-examination of text categorization methods. In SIGIR'99: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1999. Google Scholar
Digital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, ICML'97: Proceedings of the 14th International Conference on Machine Learning, pages 412--420. Morgan Kaufmann Publishers, San Francisco, US, 1997. Google Scholar
Digital Library



Comments