Abstract
As the amount of multimedia data is increasing day-by-day thanks to less expensive storage devices and increasing numbers of information sources, machine learning algorithms are faced with large-sized and noisy datasets. Fortunately, the use of a good sampling set for training influences the final results significantly. But using a simple random sample (SRS) may not obtain satisfactory results because such a sample may not adequately represent the large and noisy dataset due to its blind approach in selecting samples. The difficulty is particularly apparent for huge datasets where, due to memory constraints, only very small sample sizes are used. This is typically the case for multimedia applications, where data size is usually very large. In this article we propose a new and efficient method to sample of large and noisy multimedia data. The proposed method is based on a simple distance measure that compares the histograms of the sample set and the whole set in order to estimate the representativeness of the sample. The proposed method deals with noise in an elegant manner which SRS and other methods are not able to deal with. We experiment on image and audio datasets. Comparison with SRS and other methods shows that the proposed method is vastly superior in terms of sample representativeness, particularly for small sample sizes although time-wise it is comparable to SRS, the least expensive method in terms of time.
- Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the International Conference on Management of Data. Google Scholar
Digital Library
- Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large Databases. Google Scholar
Digital Library
- Angluin, D. 1988. Queries and concept learning. Mach. Learn. 2, 4, 319--342. Google Scholar
Digital Library
- Astashyn, A. 2004. Deterministic data reduction methods for transactional data sets. Master thesis.Google Scholar
- Atlas, L., Cohn, D., Ladner, R., El-Sharkawi, M. A., and R. J. Marks, I. 1990. Training connectionist networks with queries and selective sampling. In Advances in Neural Information Processing Systems, vol. 2. Morgan Kaufmann, San Fransisco, CA. 566--573. Google Scholar
Digital Library
- Breiman, L. 1996. Bagging predictors. Mach. Learn. 24, 2, 123--140. Google Scholar
Digital Library
- Brodley, C. and Friedl, M. 1999. Identifying mislabeled training data. J. Artif. Intell. Res. 11, 131--167.Google Scholar
Cross Ref
- Bronnimann, H., Chen, B., Dash, M., Haas, P., and Scheuermann, P. 2003. Efficient data reduction with EASE. In Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining. 59--68. Google Scholar
Digital Library
- Chapelle, O., Halffiner, P., and Vapnik, V. N. 1999. Support vector machine for histogram based image classification. IEEE Trans. Neutral Netw. 10, 5, 1055--1064. Google Scholar
Digital Library
- Chawla, N., Eschrich, S., and Hall, L. O. 2001. Creating ensembles of classifiers. In Proceedings of the International Conference on Data Mining. 580--581. Google Scholar
Digital Library
- Chen, B., Haas, P., and Scheuermann, P. 2002. A new two-phase sampling based algorithm for discovering association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining. Google Scholar
Digital Library
- Cohn, D. A., Ghahramani, Z., and Jordan, M. I. 1995. Active learning with statistical models. In Advances in Neural Information Processing Systems, vol. 7. MIT Press, New Yark. 705--712.Google Scholar
- Dougherty, J., Kohavi, R., and Sahami, M. 1995. Supervised and unsupervised discretization of continuous features. In International Conference on Machine Learning. 194--202.Google Scholar
- Duan, L., Xu, M., Chua, T., Tian, Q., and Xu, C. 2003. A mid-level representation framework for semantic sports video analysis. In Proceedings of the ACM Multimedia Conference. Google Scholar
Digital Library
- Gu, B., F. Hu, F., and Liu, H. 2000. Sampling and its applications in data mining: A survey. Tech. Rep. School of Computing, National University of Singapore, Singapore.Google Scholar
- Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data. Google Scholar
Digital Library
- ISO/IEC15938-8/FDIS3. IED. 2005. Information technology---Multimedia content description interface---Part 8: Extraction and use of MPEG-7 descriptions.Google Scholar
- Iyengar, V. S., Apte, C., and Zhang, T. 2000. Active learning using adaptive resampling. In Proceedings of the International Conference on Knowledge Discovery and Data Mining. 92--98. Google Scholar
Digital Library
- Jin, R., Yan, R., and Hauptmann, A. 2003. Image classification using a bigram model. In AAAI Spring Symposium on Intelligent Multimedia Knowledge Management.Google Scholar
- Lewis, D. D. and Catlett, J. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the 11th International Conference on Machine Learning, W. W. Cohen and H. Hirsh, eds. Morgan Kaufmann, San Francisco, CA. 148--156.Google Scholar
- Lewis, D. D. and Gale, W. A. 1994. A sequential algorithm for training text classifiers. In Proceedings of the SIGIR 17th ACM International Conference on Research and Development in Information Retrieval, W. B. Croft and C. J. van Rijsbergen, eds. Springer Verlag, Berlin. 3--12. Google Scholar
Digital Library
- Manjunath, B. S., Salembier, P., and Sikora, T. 2002. Introduction to MPEG-7. John Wiley, New Yark.Google Scholar
- Meek, C., Thiesson, B., and Heckerman, D. 2002. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Res. 2, 3, 397--418. Google Scholar
Digital Library
- Nepal, S., Srinivasan, U., and Reynolds, G. 2001. Automatic detection of goal segments in basketball videos. In Proceedings of the ACM Multimedia Conference (Los Angeles, CA). Google Scholar
Digital Library
- Ojala, T., Aittola, M., and Matinmikko, E. 2002. Empirical evaluation MPEG-7 XM color descriptors in content-based retrieval of semantic image categories. In Proceedings of the 16th International Conference on Pattern Recognition (Quebec, Canada). 1021--1024.Google Scholar
- Olken, F. 1993. Random sampling from databases. Ph.D. thesis, Department of Computer Science, University of California, Berkely.Google Scholar
- Plutowski, M. and White, H. 1993. Selecting concise training sets from clean data. IEEE Trans. Neural Netw. 4, 2, 305--318.Google Scholar
Digital Library
- Rui, Y., Gupta, A., and Acero, A. 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the ACM Multimedia Conference. 105--115. Google Scholar
Digital Library
- Saar-Tsechansky, M. and Provost, F. 2001. Active learning for class probability estimation and ranking. In Proceedings of the 17th International Joint Conference on Artificial Intelligence. 911--920. Google Scholar
Digital Library
- Sarawagi, S. and Bhamidipaty, A. 2002. Interactive deduplication using active learning. In Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining. Google Scholar
Digital Library
- Scheffer, T., Decomain, C., and Wrobel, S. 2001. Active hidden Markov models for information extraction. In Proceedings of the International Symposium on Intelligent Data Analysis. Google Scholar
Digital Library
- Tong, S. and Koller, D. 2000. Support vector machine active learning with applications to text classification. In Proceedings of the 17th International Conference on Machine Learning, P. Langley, ed. Morgan Kaufmann, San Francisco, CA. 999--1006. Google Scholar
Digital Library
- Vitter, J. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1 (Mar.), 37--57. Google Scholar
Digital Library
- Wang, S., Dash, M., and Chia, L.-T. 2005a. Efficient sampling for image application. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining.Google Scholar
- Wang, S., Xu, M., Chia, L.-T., and Dash, M. 2005b. Easier sampling for audio event identification. In Proceedings of the International Conference on Multimedia and Expo.Google Scholar
- Xu, M., Duan, L.-Y., Cai, J., Chia, L.-T., Xu, C.-S., and Tian, Q. 2004a. Hmm-Based audio keyword generation. In Proceedings of the Pacific Conference on Multimedia. vol. 3. 566--574. Google Scholar
Digital Library
- Xu, M., Duan, L.-Y., Chia, L.-T., and C.-S.Xu. 2004b. Audio keywords generation for sports video analysis. In Proceedings of the ACM Multimedia Conference. Google Scholar
Digital Library
- Young, S., Everman, G., Gales, M., Hain, T., Kershaw, D. Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. 2002. The HTK book (for HTK version 3.1). Cambridge University Engineering Department.Google Scholar
- Zhu, X. and Wu, X. 2004. Class noise vs. attribute noise: A quantitative study of their impacts. Artif. Intell. Rev. 22, 3, 177--210. Google Scholar
Digital Library
- Zhu, X., Wu, X., and Chen, S. 2003. Eliminating class noise in large datasets. In Proceedings of the 20th ICML International Conference on Machine Learning. 920--927.Google Scholar
Index Terms
Efficient sampling of training set in large and noisy multimedia data
Recommendations
Efficient data reduction in multimedia data
AbstractAs the amount of multimedia data is increasing day-by-day thanks to cheaper storage devices and increasing number of information sources, the machine learning algorithms are faced with large-sized datasets. When original data is huge in size small ...
Line segment sampling with blue-noise properties
Line segment sampling has recently been adopted in many rendering algorithms for better handling of a wide range of effects such as motion blur, defocus blur and scattering media. A question naturally raised is how to generate line segment samples with ...
Efficient sampling: application to image data
PAKDD'05: Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data MiningSampling is an important preprocessing algorithm that is used to mine large data efficiently. Although a simple random sample often works fine for reasonable sample size, accuracy falls sharply with reduced sample size. In kdd'03 we proposed ease that ...






Comments