ABSTRACT
The issue of the automatic classification of research articles into one or more fields of science is of primary importance for scientific databases and digital libraries. A sophisticated classification strategy renders searching more effective and assists the users in locating similar relevant items. Although the most publishing services require from the authors to categorize their articles themselves, there are still cases where older documents remain unclassified, or the taxonomy changes over time. In this work we attempt to address this interesting problem by introducing a machine learning algorithm which combines several parameters and meta-data of a research article. In particular, our model exploits the training set to correlate keywords, authors, co-authorship, and publishing journals to a number of labels of the taxonomy. In the sequel, it applies this information to classify the rest of the documents. The experiments we have conducted with a large dataset comprised of about 1,5 million articles, demonstrate that in this specific application, our model outperforms the AdaBoost.MH and SVM methods.
- CiteSeerX. http://csxstatic.ist.psu.edu/about/data.Google Scholar
- S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. Third Text REtrieval Conference, Gaithersburg, USA, 1994.Google Scholar
- S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 Extension to Multiple Weighted Fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 42--49, 2004. Google Scholar
Digital Library
- Q. Lu, and L. Getoor Link-based Classification, Advanced Methods for Knowledge Discovery from Complex Data, 2005Google Scholar
- F. Sebastiani. Machine Learning in Automated Text Categorization ACM computing surveys (CSUR), 2002 vol. 34, issue 1, pp. 1--47 Google Scholar
Digital Library
- Y. Yang. An evaluation of statistical approaches to text categorization Information Retrieval, 1999 vol. 1, issue 1, pp. 69--90 Google Scholar
Digital Library
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features, Machine Learning: ECML 1998, pp. 137--142 Google Scholar
Digital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization, In Proceedings of the International Conference on Machine Learning (ICML'97), 1997, pp. 412--420. Google Scholar
Digital Library
- X. Qi and B. D. Davidson. Web page classification: Features and algorithms, ACM Computing Surveys (CSUR), 2009, vol. 41, issue 2, pp. 1--31 Google Scholar
Digital Library
- M. Bilgic, and G. M. Namata, and L. Getoor. Combining collective classification and link prediction, In Proceedings of Workshop on Mining Graphs and Complex Structures at the IEEE International Conference on Data Mining, 2007, pp. 381--386 Google Scholar
Digital Library
- L. Getoor and C. P. Diehl. Link mining: a survey, ACM SIGKDD Explorations Newsletter, 2005, vol. 7, issue 2, pp. 3--12 Google Scholar
Digital Library
- S. A. Macskassy and F. Provost. Classification in networked data: A toolkit and a univariate case study, The Journal of Machine Learning Research, 2007, vol. 8, pp. 935--983 Google Scholar
Digital Library
- N. Fuhr. A probabilistic model of dictionary-based automatic indexing. In Proceedings of RIAO-85, 1st International Conference "Recherche d'Information Assistee par Ordinateur", 1985, pp. 207--216.Google Scholar
- T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the International Conference on Machine Learning (ICML'99), 1999, pp. 200--209. Google Scholar
Digital Library
- Zhu, J. and Rosset, S. and Zou, H. and Hastie, T. Multi-class adaboost. In Ann Arbor, 2006 vol. 1001, issue 1, pp. 1612--1631.Google Scholar
Index Terms
A supervised machine learning classification algorithm for research articles
Recommendations
A Combined Classification Algorithm Based on C4.5 and NB
ISICA '08: Proceedings of the 3rd International Symposium on Advances in Computation and IntelligenceWhen our learning task is to build a model with accurate classification, C4.5 and NB are two very important algorithms for achieving this task because of their simplicity and high performance. In this paper, we present a combined classification ...
A novel ensemble machine learning for robust microarray data classification
Microarray data analysis and classification has demonstrated convincingly that it provides an effective methodology for the effective diagnosis of diseases and cancers. Although much research has been performed on applying machine learning techniques ...
Development of predictive model of diabetic using supervised machine learning classification algorithm of ensemble voting
Predicting the health status of patients suffering from diabetic is an important task in the health sector because the medical history of diabetic evidenced that it is a slow killer. If data collection is enough, suitable, and noise-free, such ...






Comments