skip to main content
article

Efficient sampling of training set in large and noisy multimedia data

Authors Info & Claims
Published:01 August 2007Publication History
Skip Abstract Section

Abstract

As the amount of multimedia data is increasing day-by-day thanks to less expensive storage devices and increasing numbers of information sources, machine learning algorithms are faced with large-sized and noisy datasets. Fortunately, the use of a good sampling set for training influences the final results significantly. But using a simple random sample (SRS) may not obtain satisfactory results because such a sample may not adequately represent the large and noisy dataset due to its blind approach in selecting samples. The difficulty is particularly apparent for huge datasets where, due to memory constraints, only very small sample sizes are used. This is typically the case for multimedia applications, where data size is usually very large. In this article we propose a new and efficient method to sample of large and noisy multimedia data. The proposed method is based on a simple distance measure that compares the histograms of the sample set and the whole set in order to estimate the representativeness of the sample. The proposed method deals with noise in an elegant manner which SRS and other methods are not able to deal with. We experiment on image and audio datasets. Comparison with SRS and other methods shows that the proposed method is vastly superior in terms of sample representativeness, particularly for small sample sizes although time-wise it is comparable to SRS, the least expensive method in terms of time.

References

  1. Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Databases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 4, 319--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Astashyn, A. 2004. Deterministic data reduction methods for transactional data sets. Master thesis.Google ScholarGoogle Scholar
  5. Atlas, L., Cohn, D., Ladner, R., El-Sharkawi, M. A., and R. J. Marks, I. 1990. Training connectionist networks with queries and selective sampling. In Advances in Neural Information Processing Systems, vol. 2. Morgan Kaufmann, San Fransisco, CA. 566--573. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Breiman, L. 1996. Bagging predictors. Mach. Learn. 24, 2, 123--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brodley, C. and Friedl, M. 1999. Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131--167.Google ScholarGoogle ScholarCross RefCross Ref
  8. Bronnimann, H., Chen, B., Dash, M., Haas, P., and Scheuermann, P. 2003. Efficient data reduction with EASE. In Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining. 59--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chapelle, O., Halffiner, P., and Vapnik, V. N. 1999. Support vector machine for histogram based image classification. IEEE Trans. Neutral Netw. 10, 5, 1055--1064. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chawla, N., Eschrich, S., and Hall, L. O. 2001. Creating ensembles of classifiers. In Proceedings of the International Conference on Data Mining. 580--581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chen, B., Haas, P., and Scheuermann, P. 2002. A new two-phase sampling based algorithm for discovering association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cohn, D. A., Ghahramani, Z., and Jordan, M. I. 1995. Active learning with statistical models. In Advances in Neural Information Processing Systems, vol. 7. MIT Press, New Yark. 705--712.Google ScholarGoogle Scholar
  13. Dougherty, J., Kohavi, R., and Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In International Conference on Machine Learning. 194--202.Google ScholarGoogle Scholar
  14. Duan, L., Xu, M., Chua, T., Tian, Q., and Xu, C. 2003. A mid-level representation framework for semantic sports video analysis. In Proceedings of the ACM Multimedia Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gu, B., F. Hu, F., and Liu, H. 2000. Sampling and its applications in data mining: A survey. Tech. Rep. School of Computing, National University of Singapore, Singapore.Google ScholarGoogle Scholar
  16. Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ISO/IEC15938-8/FDIS3. IED. 2005. Information technology---Multimedia content description interface---Part 8: Extraction and use of MPEG-7 descriptions.Google ScholarGoogle Scholar
  18. Iyengar, V. S., Apte, C., and Zhang, T. 2000. Active learning using adaptive resampling. In Proceedings of the International Conference on Knowledge Discovery and Data Mining. 92--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jin, R., Yan, R., and Hauptmann, A. 2003. Image classification using a bigram model. In AAAI Spring Symposium on Intelligent Multimedia Knowledge Management.Google ScholarGoogle Scholar
  20. Lewis, D. D. and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the 11th International Conference on Machine Learning, W. W. Cohen and H. Hirsh, eds. Morgan Kaufmann, San Francisco, CA. 148--156.Google ScholarGoogle Scholar
  21. Lewis, D. D. and Gale, W. A. 1994. A sequential algorithm for training text classifiers. In Proceedings of the SIGIR 17th ACM International Conference on Research and Development in Information Retrieval, W. B. Croft and C. J. van Rijsbergen, eds. Springer Verlag, Berlin. 3--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Manjunath, B. S., Salembier, P., and Sikora, T. 2002. Introduction to MPEG-7. John Wiley, New Yark.Google ScholarGoogle Scholar
  23. Meek, C., Thiesson, B., and Heckerman, D. 2002. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res. 2, 3, 397--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Nepal, S., Srinivasan, U., and Reynolds, G. 2001. Automatic detection of goal segments in basketball videos. In Proceedings of the ACM Multimedia Conference (Los Angeles, CA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ojala, T., Aittola, M., and Matinmikko, E. 2002. Empirical evaluation MPEG-7 XM color descriptors in content-based retrieval of semantic image categories. In Proceedings of the 16th International Conference on Pattern Recognition (Quebec, Canada). 1021--1024.Google ScholarGoogle Scholar
  26. Olken, F. 1993. Random sampling from databases. Ph.D. thesis, Department of Computer Science, University of California, Berkely.Google ScholarGoogle Scholar
  27. Plutowski, M. and White, H. 1993. Selecting concise training sets from clean data. IEEE Trans. Neural Netw. 4, 2, 305--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Rui, Y., Gupta, A., and Acero, A. 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the ACM Multimedia Conference. 105--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Saar-Tsechansky, M. and Provost, F. 2001. Active learning for class probability estimation and ranking. In Proceedings of the 17th International Joint Conference on Artificial Intelligence. 911--920. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Scheffer, T., Decomain, C., and Wrobel, S. 2001. Active hidden Markov models for information extraction. In Proceedings of the International Symposium on Intelligent Data Analysis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tong, S. and Koller, D. 2000. Support vector machine active learning with applications to text classification. In Proceedings of the 17th International Conference on Machine Learning, P. Langley, ed. Morgan Kaufmann, San Francisco, CA. 999--1006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Vitter, J. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1 (Mar.), 37--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wang, S., Dash, M., and Chia, L.-T. 2005a. Efficient sampling for image application. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining.Google ScholarGoogle Scholar
  35. Wang, S., Xu, M., Chia, L.-T., and Dash, M. 2005b. Easier sampling for audio event identification. In Proceedings of the International Conference on Multimedia and Expo.Google ScholarGoogle Scholar
  36. Xu, M., Duan, L.-Y., Cai, J., Chia, L.-T., Xu, C.-S., and Tian, Q. 2004a. Hmm-Based audio keyword generation. In Proceedings of the Pacific Conference on Multimedia. vol. 3. 566--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xu, M., Duan, L.-Y., Chia, L.-T., and C.-S.Xu. 2004b. Audio keywords generation for sports video analysis. In Proceedings of the ACM Multimedia Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Young, S., Everman, G., Gales, M., Hain, T., Kershaw, D. Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. 2002. The HTK book (for HTK version 3.1). Cambridge University Engineering Department.Google ScholarGoogle Scholar
  39. Zhu, X. and Wu, X. 2004. Class noise vs. attribute noise: A quantitative study of their impacts. Artif. Intell. Rev. 22, 3, 177--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zhu, X., Wu, X., and Chen, S. 2003. Eliminating class noise in large datasets. In Proceedings of the 20th ICML International Conference on Machine Learning. 920--927.Google ScholarGoogle Scholar

Index Terms

  1. Efficient sampling of training set in large and noisy multimedia data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!