Abstract
We propose a methodology for building a robust query classification system that can identify thousands of query classes, while dealing in real time with the query volume of a commercial Web search engine. We use a pseudo relevance feedback technique: given a query, we determine its topic by classifying the Web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregate account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.
- Beitzel, S., Jensen, E., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized web query log. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 321--328. Google Scholar
Digital Library
- Beitzel, S., Jensen, E., Frieder, O., Grossman, D., Lewis, D., Chowdhury, A., and Kolcz, A. 2005a. Automatic web query classification using labeled and unlabeled training data. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, 581--582. Google Scholar
Digital Library
- Beitzel, S., Jensen, E., Frieder, O., Lewis, D., Chowdhury, A., and Kolcz, A. 2005b. Improving automatic query classification via semi-supervised learning. In Proceedings of the 5th IEEE International Conference on Data Mining. IEEE Computer Society. Google Scholar
Digital Library
- Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A., and Frieder, O. 2007. Automatic classification of web queries using very large unlabeled query logs. ACM Trans. Inform. Syst. 25, 1--29. Google Scholar
Digital Library
- Broder, A., Ciccolo, P., Fontoura, M., Gabrilovich, E., Josifovski, V., and Riedel, L. 2008. Search advertising using Web relevance feedback. In Proceedings of the Conference on Information and Knowledge Management (CIKM). Google Scholar
Digital Library
- Broder, A., Ciccolo, P., Gabrilovich, E., Josifovski, V., Metzler, D., Riedel, L., and Yuan, J. 2009. Online expansion of rare queries for sponsored search. In Proceedings of the 18th International World Wide Web Conference. Google Scholar
Digital Library
- Broder, A., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., and Zhang, T. 2007. Robust classification of rare queries using web knowledge. In Proceedings of the 30th ACM International Conference on Research and Development in Information Retrieval. ACM Press. Google Scholar
Digital Library
- Duda, R. and Hart, P. 1973. Pattern Classification and Scene Analysis. John Wiley and Sons, New York, NY.Google Scholar
- Efthimiadis, E. and Biron, P. 1994. UCLA-Okapi at TREC-2: Query expansion experiments. In Proceedings of the Text REtrieval Conference (TREC-2). National Institute of Standards and Technology (NIST).Google Scholar
- Gabrilovich, E. and Markovitch, S. 2007. Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Mach. Learn. Res. 8, 2297--2345. Google Scholar
Digital Library
- Gates, S. C., Teiken, W., and Cheng, K.-S. F. 2005. Taxonomies by the numbers: building high-performance taxonomies. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, 568--577. Google Scholar
Digital Library
- Gravano, L., Hatzivassiloglou, V., and Lichtenstein, R. 2003. Categorizing web queries according to geographical locality. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM). ACM Press, 325--333. Google Scholar
Digital Library
- Han, E. and Karypis, G. 2000. Centroid-based document classification: Analysis and experimental results. In Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). Springer-Verlag. Google Scholar
Digital Library
- Jarvelin, K. and Kekalainen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the ACM Conference and Research and Development in Information Retrieval (SIGIR). ACM Press. Google Scholar
Digital Library
- Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning (ECML). 137--142. Google Scholar
Digital Library
- Kardkovacs, Z., Tikk, D., and Bansaghi, Z. 2005. The ferrety algorithm for the KDD Cup 2005 problem. In SIGKDD Explorations. Vol. 7. ACM, 111--116. Google Scholar
Digital Library
- Kowalczyk, P., Zukerman, I., and Niemann, M. 2004. Analyzing the effect of query class on document retrieval performance. In Proceedings of the Australian Conference on Artificial Intelligence. Springer, 550--561. Google Scholar
Digital Library
- Li, Y., Zheng, Z., and Dai, H. 2005. KDD CUP-2005 report: Facing a great challenge. In SIGKDD Explorations. Vol. 7. ACM, 91--99. Google Scholar
Digital Library
- Lu, Y., Peng, F., Li, X., and Ahmed, N. 2006. Coupling feature selection and machine learning methods for navigational query identification. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, 682--689. Google Scholar
Digital Library
- Manning, C. D., Raghavan, P., and Schuetze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google Scholar
Digital Library
- McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In AAAI/ICML Workshop on Learning for Text Categorization. 41--48.Google Scholar
- Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, 206--214. Google Scholar
Digital Library
- Moran, M. and Hunt, B. 2005. Search Engine Marketing, Inc.: Driving Search Traffic to Your Company's Web Site. Prentice Hall, Upper Saddle River, NJ. Google Scholar
Digital Library
- Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., and Gatford, M. 1995. Okapi at TREC-3. In Proceedings of the Text REtrieval Conference (TREC-3). NIST, Gaithersburg, MD.Google Scholar
- Rocchio, J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ, 313--323.Google Scholar
- Sahami, M., Mittal, V., Baluja, S., and Rowley, H. 2004. The happy searcher: Challenges in web information retrieval. In Proceedings of the 8th Pacific Rim International Conference on Artificial Intelligence. Springer-Verlag.Google Scholar
- Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inform. Proc. Manag. 24, 5, 513--523. Google Scholar
Digital Library
- Salton, G. and Buckley, C. 1990. Improving retrieval performance by relevance feedback. J. Am. Soc. Inform. Sci. 41, 4, 288--297.Google Scholar
Cross Ref
- Santner, T. and Duffy, D. 1989. The Statistical Analysis of Discrete Data. Springer-Verlag.Google Scholar
- Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1, 1--47. Google Scholar
Digital Library
- Shen, D., Pan, R., Sun, J., Pan, J., Wu, K., Yin, J., and Yang, Q. 2005. Q2C@UST: Our winning solution to query classification in KDDCUP 2005. In SIGKDD Explorations. Vol. 7. ACM, Chicago, IL, USA, 100--110. Google Scholar
Digital Library
- Shen, D., Pan, R., Sun, J., Pan, J., Wu, K., Yin, J., and Yang, Q. 2006a. Query enrichment for web-query classification. ACM Trans. Info. Syst. 24, 320--352. Google Scholar
Digital Library
- Shen, D., Sun, J., Yang, Q., and Chen, Z. 2006b. Building bridges for web query classification. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, 131--138. Google Scholar
Digital Library
- Vogel, D., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., and Scheffer, T. 2005. Classifying search engine queries using the web as background knowledge. In SIGKDD Explorations. Vol. 7. ACM, 117--122. Google Scholar
Digital Library
- Voorhees, E. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR). Springer-Verlag, 61--69. Google Scholar
Digital Library
- Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inform. Sys. 18, 1, 79--112. Google Scholar
Digital Library
- Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inform. Retriev. J. 1, 69--90. Google Scholar
Digital Library
- Zhang, T. and Oles, F. J. 2001. Text categorization based on regularized linear classification methods. Inform. Retriev. 4, 5--31. Google Scholar
Digital Library
Index Terms
Classifying search queries using the Web as a source of knowledge
Recommendations
Robust classification of rare queries using web knowledge
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalWe propose a methodology for building a practical robust query classification system that can identify thousands of query classes with reasonable accuracy, while dealing in real-time with the query volume of a commercial web search engine. We use a ...
An approach to use query-related web context on document ranking
ICUIMC '11: Proceedings of the 5th International Conference on Ubiquitous Information Management and CommunicationWith the development of Web search engines, it is considered as an important task to provide retrieved documents in a proper manner. Many search engines have used various document ranking algorithms to provide their retrieved documents in a more ...
Regularized query classification using search click information
Hundreds of millions of users each day submit queries to the Web search engine. The user queries are typically very short which makes query understanding a challenging problem. In this paper, we propose a novel approach for query representation and ...








Comments