skip to main content
10.1145/2783258.2783372acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Efficient Online Evaluation of Big Data Stream Classifiers

Published: 10 August 2015 Publication History

Abstract

The evaluation of classifiers in data streams is fundamental so that poorly-performing models can be identified, and either improved or replaced by better-performing models. This is an increasingly relevant and important task as stream data is generated from more sources, in real-time, in large quantities, and is now considered the largest source of big data. Both researchers and practitioners need to be able to effectively evaluate the performance of the methods they employ. However, there are major challenges for evaluation in a stream. Instances arriving in a data stream are usually time-dependent, and the underlying concept that they represent may evolve over time. Furthermore, the massive quantity of data also tends to exacerbate issues such as class imbalance. Current frameworks for evaluating streaming and online algorithms are able to give predictions in real-time, but as they use a prequential setting, they build only one model, and are thus not able to compute the statistical significance of results in real-time. In this paper we propose a new evaluation methodology for big data streams. This methodology addresses unbalanced data streams, data where change occurs on different time scales, and the question of how to split the data between training and testing, over multiple models.

References

[1]
A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. URL http://www.ics.uci.edu/ mlearn/MLRepository.html.
[2]
Albert Bifet and Ricard Gavaldà. Learning from time-changing data with adaptive windowing. In SDM, 2007.
[3]
Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. MOA: Massive Online Analysis. Journal of Machine Learning Research (JMLR), 2010. URL http://moa.cms.waikato.ac.nz/.
[4]
Albert Bifet, Geoff Holmes, and Bernhard Pfahringer. Leveraging bagging for evolving data streams. In ECML PKDD, pages 135--150, Berlin, Heidelberg, 2010. Springer-Verlag.
[5]
Avrim Blum, Adam Kalai, and John Langford. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In COLT, pages 203--208, 1999.
[6]
Remco R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. In ICML, pages 51--58, 2003.
[7]
Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984.
[8]
Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20 (1): 37--46, April 1960.
[9]
Gianmarco De Francisci Morales and Albert Bifet. SAMOA: Scalable Advanced Massive Online Analysis. Journal of Machine Learning Research, 16: 149--153, 2015. URL http://samoa-project.net.
[10]
Janez Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7: 1--30, 2006.
[11]
Thomas G. Dietterich. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10 (7): 1895--1923, 1998.
[12]
Joo Gãama, Pedro Medas, Gladys Castillo, and Pedro Pereira Rodrigues. Learning with drift detection. In SBIA, pages 286--295, 2004.
[13]
João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues. On evaluating stream learning algorithms. Machine Learning, pages 1--30, 2013.
[14]
D. Hand. Classifier technology and the illusion of progress. Statistical Science, 21 (1): 1--14, 2006.
[15]
Michael Harries. Splice-2 comparative evaluation: Electricity pricing. Technical report, The University of South Wales, 1999.
[16]
Geoff Hulten and Pedro Domingos. VFML -- a toolkit for mining high-speed time-changing data streams. 2003. URL http://www.cs.washington.edu/dm/vfml/.
[17]
Geoff Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In KDD, pages 97--106, 2001.
[18]
N. Japkowicz and M. Shah. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, 2011.
[19]
John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6: 273--306, 2005.
[20]
John Langford. Vowpal Wabbit, http://hunch.net/~vw/, 2011. URL http://hunch.net/ vw/.
[21]
James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute Report, 2011.
[22]
N. Oza and S. Russell. Online bagging and boosting. In Artificial Intelligence and Statistics 2001, pages 105--112. Morgan Kaufmann, 2001.
[23]
C. Oza and Stuart J. Russell. Experimental comparisons of online and batch versions of bagging and boosting. In KDD, pages 359--364, 2001.
[24]
Nicos G. Pavlidis, Dimitris K. Tasoulis, Niall M. Adams, and David J. Hand. λ-Perceptron: An adaptive classifier for data streams. Pattern Recognition, 44 (1): 78--96, 2011.
[25]
Mohak Shah. Generalized agreement statistics over fixed group of experts. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2011, pages 191--206, 2011.
[26]
W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In KDD, pages 377--382, 2001.
[27]
Indre Zliobaite, Albert Bifet, Jesse Read, Bernhard Pfahringer, and Geoff Holmes. Evaluation methods and decision theory for classification of streaming data with temporal dependence. Machine Learning, 98 (3): 455--482, 2015.

Cited By

View all
  • (2024)Learning to Detect Soccer Pass at the Edge: A Comparison Between Edge and Cloud-Based Streaming Machine Learning Approaches2024 IEEE International Workshop on Sport, Technology and Research (STAR)10.1109/STAR62027.2024.10635986(187-192)Online publication date: 8-Jul-2024
  • (2024)Roadmap of Concept Drift Adaptation in Data Stream Mining, Years LaterIEEE Access10.1109/ACCESS.2024.335881712(21129-21146)Online publication date: 2024
  • (2024)A distributed platform for intrusion detection system using data stream mining in a big data environmentAnnals of Telecommunications10.1007/s12243-024-01046-079:7-8(507-521)Online publication date: 8-Jun-2024
  • Show More Cited By

Index Terms

  1. Efficient Online Evaluation of Big Data Stream Classifiers

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
    August 2015
    2378 pages
    ISBN:9781450336642
    DOI:10.1145/2783258
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 August 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. classification
    2. data streams
    3. evaluation
    4. online learning

    Qualifiers

    • Research-article

    Conference

    KDD '15
    Sponsor:

    Acceptance Rates

    KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 24 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Learning to Detect Soccer Pass at the Edge: A Comparison Between Edge and Cloud-Based Streaming Machine Learning Approaches2024 IEEE International Workshop on Sport, Technology and Research (STAR)10.1109/STAR62027.2024.10635986(187-192)Online publication date: 8-Jul-2024
    • (2024)Roadmap of Concept Drift Adaptation in Data Stream Mining, Years LaterIEEE Access10.1109/ACCESS.2024.335881712(21129-21146)Online publication date: 2024
    • (2024)A distributed platform for intrusion detection system using data stream mining in a big data environmentAnnals of Telecommunications10.1007/s12243-024-01046-079:7-8(507-521)Online publication date: 8-Jun-2024
    • (2024)Data stream classification using a deep transfer learning method based on extreme learning machine and recurrent neural networkMultimedia Tools and Applications10.1007/s11042-023-18075-x83:23(63213-63241)Online publication date: 11-Jan-2024
    • (2024)An online ensemble method for auto-scaling NFV-based applications in the edgeCluster Computing10.1007/s10586-024-04465-927:4(4255-4279)Online publication date: 1-May-2024
    • (2023)Towards time-evolving analytics: Online learning for time-dependent evolving data streamsData Science10.3233/DS-2200576:1-2(1-16)Online publication date: 8-Dec-2023
    • (2023)Streaming Machine Learning for Supporting Data Prefetching in Modern Data Storage SystemsProceedings of the First Workshop on AI for Systems10.1145/3588982.3603608(7-12)Online publication date: 10-Aug-2023
    • (2023)Choosing the Right Time to Learn Evolving Data Streams2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386551(5156-5165)Online publication date: 15-Dec-2023
    • (2023)Continuous Soccer Pass Detection: A Comparison between Traditional and Streaming Machine Learning Methods2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386458(2673-2682)Online publication date: 15-Dec-2023
    • (2023)An automatic sleep-scoring system in elderly women with osteoporosis fractures using frequency localized finite orthogonal quadrature Fejer Korovkin kernelsMedical Engineering & Physics10.1016/j.medengphy.2023.103956112(103956)Online publication date: Mar-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media