ABSTRACT
This paper promotes a new task for supervised machine learning research: quantification - the pursuit of learning methods for accurately estimating the class distribution of a test set, with no concern for predictions on individual cases. A variant for cost quantification addresses the need to total up costs according to categories predicted by imperfect classifiers. These tasks cover a large and important family of applications that measure trends over time.The paper establishes a research methodology, and uses it to evaluate several proposed methods that involve selecting the classification threshold in a way that would spoil the accuracy of individual classifications. In empirical tests, Median Sweep methods show outstanding ability to estimate the class distribution, despite wide disparity in testing and training conditions. The paper addresses shifting class priors and costs, but not concept drift in general.
References
- Fawcett, T. ROC graphs: notes and practical considerations for data mining researchers. Hewlett-Packard Labs, Tech Report HPL-2003-4, 2003. www.hpl.hp.com/techreports]]Google Scholar
- Fawcett, T. and Flach, P. A response to Webb and Ting's 'On the application of ROC analysis to predict classification performance under varying class distributions.' Machine Learning, 58(1):33--38, 2005.]]Google Scholar
- Forman, G., Kirshenbaum, E., and Suermondt, J. Pragmatic text mining: minimizing human effort to quantify many issues in call logs. In Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD, Philadelphia), 2006.]] Google Scholar
Digital Library
- Forman, G. Counting positives accurately despite inaccurate classification. In Proc. of the 16th European Conf. on Machine Learning (ECML, Porto):564--575, 2005.]] Google Scholar
Digital Library
- Forman, G. An extensive empirical study of feature selection metrics for text classification. J. of Machine Learning Research, 3(Mar):1289--1305, 2003.]] Google Scholar
Digital Library
- Han, E. and Karypis, G. Centroid-based document classification: analysis & experimental results. In Proc. of the 4th European Conf. on the Principles of Data Mining and Knowledge Discovery (PKDD): 424--431, 2000.]] Google Scholar
Digital Library
- Havre, S., Hetzler, E., Whitney, P., and Nowell, L. ThemeRiver: visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics, 8(1):9--20, 2002.]] Google Scholar
Digital Library
- Mei, Q. and Zhai, C. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proc. of the 11th ACM SIGKDD Int'l Conf. on Knowledge Discovery in Data Mining (KDD, Chicago): 198--207, 2005.]] Google Scholar
Digital Library
- Saerens, M., Latinne, P., and Decaestecker, C. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21--41, 2002.]] Google Scholar
Digital Library
- Vucetic, S. and Obradovic, Z. Classification on data with biased class distribution. In Proc. of the 12th European Conf. on Machine Learning (ECML, Freiburg):527--538, 2001.]] Google Scholar
Digital Library
- Witten, I. and Frank, E., Data mining: Practical machine learning tools and techniques (2nd edition), Morgan Kaufmann, San Francisco, CA, 2005.]] Google Scholar
Digital Library
- Wu, G. and Chang, E. KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. on Knowledge and Data Engineering, 17(6):786--795, 2005.]] Google Scholar
Digital Library
Index Terms
Quantifying trends accurately despite classifier error and class imbalance



Comments