10.1145/1150402.1150423acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedings
ARTICLE

Quantifying trends accurately despite classifier error and class imbalance

ABSTRACT

This paper promotes a new task for supervised machine learning research: quantification - the pursuit of learning methods for accurately estimating the class distribution of a test set, with no concern for predictions on individual cases. A variant for cost quantification addresses the need to total up costs according to categories predicted by imperfect classifiers. These tasks cover a large and important family of applications that measure trends over time.The paper establishes a research methodology, and uses it to evaluate several proposed methods that involve selecting the classification threshold in a way that would spoil the accuracy of individual classifications. In empirical tests, Median Sweep methods show outstanding ability to estimate the class distribution, despite wide disparity in testing and training conditions. The paper addresses shifting class priors and costs, but not concept drift in general.

References

  1. Fawcett, T. ROC graphs: notes and practical considerations for data mining researchers. Hewlett-Packard Labs, Tech Report HPL-2003-4, 2003. www.hpl.hp.com/techreports]]Google ScholarGoogle Scholar
  2. Fawcett, T. and Flach, P. A response to Webb and Ting's 'On the application of ROC analysis to predict classification performance under varying class distributions.' Machine Learning, 58(1):33--38, 2005.]]Google ScholarGoogle Scholar
  3. Forman, G., Kirshenbaum, E., and Suermondt, J. Pragmatic text mining: minimizing human effort to quantify many issues in call logs. In Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD, Philadelphia), 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Forman, G. Counting positives accurately despite inaccurate classification. In Proc. of the 16th European Conf. on Machine Learning (ECML, Porto):564--575, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Forman, G. An extensive empirical study of feature selection metrics for text classification. J. of Machine Learning Research, 3(Mar):1289--1305, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Han, E. and Karypis, G. Centroid-based document classification: analysis & experimental results. In Proc. of the 4th European Conf. on the Principles of Data Mining and Knowledge Discovery (PKDD): 424--431, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Havre, S., Hetzler, E., Whitney, P., and Nowell, L. ThemeRiver: visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics, 8(1):9--20, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mei, Q. and Zhai, C. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proc. of the 11th ACM SIGKDD Int'l Conf. on Knowledge Discovery in Data Mining (KDD, Chicago): 198--207, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Saerens, M., Latinne, P., and Decaestecker, C. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21--41, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Vucetic, S. and Obradovic, Z. Classification on data with biased class distribution. In Proc. of the 12th European Conf. on Machine Learning (ECML, Freiburg):527--538, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Witten, I. and Frank, E., Data mining: Practical machine learning tools and techniques (2nd edition), Morgan Kaufmann, San Francisco, CA, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wu, G. and Chang, E. KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. on Knowledge and Data Engineering, 17(6):786--795, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Quantifying trends accurately despite classifier error and class imbalance

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!