skip to main content
research-article

Classifying search queries using the Web as a source of knowledge

Published:30 April 2009Publication History
Skip Abstract Section

Abstract

We propose a methodology for building a robust query classification system that can identify thousands of query classes, while dealing in real time with the query volume of a commercial Web search engine. We use a pseudo relevance feedback technique: given a query, we determine its topic by classifying the Web search results retrieved by the query. Motivated by the needs of search advertising, we primarily focus on rare queries, which are the hardest from the point of view of machine learning, yet in aggregate account for a considerable fraction of search engine traffic. Empirical evaluation confirms that our methodology yields a considerably higher classification accuracy than previously reported. We believe that the proposed methodology will lead to better matching of online ads to rare queries and overall to a better user experience.

References

  1. Beitzel, S., Jensen, E., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized web query log. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 321--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Beitzel, S., Jensen, E., Frieder, O., Grossman, D., Lewis, D., Chowdhury, A., and Kolcz, A. 2005a. Automatic web query classification using labeled and unlabeled training data. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, 581--582. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Beitzel, S., Jensen, E., Frieder, O., Lewis, D., Chowdhury, A., and Kolcz, A. 2005b. Improving automatic query classification via semi-supervised learning. In Proceedings of the 5th IEEE International Conference on Data Mining. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Beitzel, S. M., Jensen, E. C., Lewis, D. D., Chowdhury, A., and Frieder, O. 2007. Automatic classification of web queries using very large unlabeled query logs. ACM Trans. Inform. Syst. 25, 1--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Broder, A., Ciccolo, P., Fontoura, M., Gabrilovich, E., Josifovski, V., and Riedel, L. 2008. Search advertising using Web relevance feedback. In Proceedings of the Conference on Information and Knowledge Management (CIKM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Broder, A., Ciccolo, P., Gabrilovich, E., Josifovski, V., Metzler, D., Riedel, L., and Yuan, J. 2009. Online expansion of rare queries for sponsored search. In Proceedings of the 18th International World Wide Web Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Broder, A., Fontoura, M., Gabrilovich, E., Joshi, A., Josifovski, V., and Zhang, T. 2007. Robust classification of rare queries using web knowledge. In Proceedings of the 30th ACM International Conference on Research and Development in Information Retrieval. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Duda, R. and Hart, P. 1973. Pattern Classification and Scene Analysis. John Wiley and Sons, New York, NY.Google ScholarGoogle Scholar
  9. Efthimiadis, E. and Biron, P. 1994. UCLA-Okapi at TREC-2: Query expansion experiments. In Proceedings of the Text REtrieval Conference (TREC-2). National Institute of Standards and Technology (NIST).Google ScholarGoogle Scholar
  10. Gabrilovich, E. and Markovitch, S. 2007. Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Mach. Learn. Res. 8, 2297--2345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gates, S. C., Teiken, W., and Cheng, K.-S. F. 2005. Taxonomies by the numbers: building high-performance taxonomies. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, 568--577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gravano, L., Hatzivassiloglou, V., and Lichtenstein, R. 2003. Categorizing web queries according to geographical locality. In Proceedings of the 12th International Conference on Information and Knowledge Management (CIKM). ACM Press, 325--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Han, E. and Karypis, G. 2000. Centroid-based document classification: Analysis and experimental results. In Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jarvelin, K. and Kekalainen, J. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the ACM Conference and Research and Development in Information Retrieval (SIGIR). ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning (ECML). 137--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kardkovacs, Z., Tikk, D., and Bansaghi, Z. 2005. The ferrety algorithm for the KDD Cup 2005 problem. In SIGKDD Explorations. Vol. 7. ACM, 111--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kowalczyk, P., Zukerman, I., and Niemann, M. 2004. Analyzing the effect of query class on document retrieval performance. In Proceedings of the Australian Conference on Artificial Intelligence. Springer, 550--561. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Li, Y., Zheng, Z., and Dai, H. 2005. KDD CUP-2005 report: Facing a great challenge. In SIGKDD Explorations. Vol. 7. ACM, 91--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lu, Y., Peng, F., Li, X., and Ahmed, N. 2006. Coupling feature selection and machine learning methods for navigational query identification. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM). ACM Press, 682--689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Manning, C. D., Raghavan, P., and Schuetze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In AAAI/ICML Workshop on Learning for Text Categorization. 41--48.Google ScholarGoogle Scholar
  22. Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, 206--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Moran, M. and Hunt, B. 2005. Search Engine Marketing, Inc.: Driving Search Traffic to Your Company's Web Site. Prentice Hall, Upper Saddle River, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., and Gatford, M. 1995. Okapi at TREC-3. In Proceedings of the Text REtrieval Conference (TREC-3). NIST, Gaithersburg, MD.Google ScholarGoogle Scholar
  25. Rocchio, J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ, 313--323.Google ScholarGoogle Scholar
  26. Sahami, M., Mittal, V., Baluja, S., and Rowley, H. 2004. The happy searcher: Challenges in web information retrieval. In Proceedings of the 8th Pacific Rim International Conference on Artificial Intelligence. Springer-Verlag.Google ScholarGoogle Scholar
  27. Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inform. Proc. Manag. 24, 5, 513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Salton, G. and Buckley, C. 1990. Improving retrieval performance by relevance feedback. J. Am. Soc. Inform. Sci. 41, 4, 288--297.Google ScholarGoogle ScholarCross RefCross Ref
  29. Santner, T. and Duffy, D. 1989. The Statistical Analysis of Discrete Data. Springer-Verlag.Google ScholarGoogle Scholar
  30. Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1, 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shen, D., Pan, R., Sun, J., Pan, J., Wu, K., Yin, J., and Yang, Q. 2005. Q2C@UST: Our winning solution to query classification in KDDCUP 2005. In SIGKDD Explorations. Vol. 7. ACM, Chicago, IL, USA, 100--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Shen, D., Pan, R., Sun, J., Pan, J., Wu, K., Yin, J., and Yang, Q. 2006a. Query enrichment for web-query classification. ACM Trans. Info. Syst. 24, 320--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Shen, D., Sun, J., Yang, Q., and Chen, Z. 2006b. Building bridges for web query classification. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR). ACM Press, 131--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vogel, D., Bickel, S., Haider, P., Schimpfky, R., Siemen, P., Bridges, S., and Scheffer, T. 2005. Classifying search engine queries using the web as background knowledge. In SIGKDD Explorations. Vol. 7. ACM, 117--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Voorhees, E. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR). Springer-Verlag, 61--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Trans. Inform. Sys. 18, 1, 79--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Inform. Retriev. J. 1, 69--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zhang, T. and Oles, F. J. 2001. Text categorization based on regularized linear classification methods. Inform. Retriev. 4, 5--31. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Classifying search queries using the Web as a source of knowledge

          Recommendations

          Reviews

          Aris Gkoulalas-Divanis

          With billions of pages currently available on the World Wide Web (WWW), effectively searching the Web to discover the pages relevant to user-supplied queries is of paramount importance. At the same time, the tremendous growth of the WWW has given birth to a multi-billion dollar industry involving online advertising. Both searching the Web and providing relevant advertisements to user-supplied queries require incorporating sophisticated algorithms that use external knowledge to drive the searching process. Gabrilovich et al. propose a query classification methodology. First, the top-ranked search results for the user-supplied query are found; these are all assumed to be relevant to the query. Then, the Web pages corresponding to the returned uniform resource locators (URLs) are retrieved and classified-by using a document classifier-to the most relevant node of a pre-specified commercial taxonomy of Web queries that consists of approximately 6,000 nodes. As the authors indicate, this taxonomy is two orders of magnitude larger than the ones used in previous studies. Finally, the classifications produced are used to classify the original user-supplied query, resulting in a highly accurate classification. Applying the proposed methodology can significantly improve the search results returned by Web search engines and provide more focused Web advertisements. Overall, this paper is both interesting and promising, as it opens a series of research directions that deserve further investigation. I recommend this paper to all researchers in the Web information retrieval (IR) field. Online Computing Reviews Service

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on the Web
            ACM Transactions on the Web  Volume 3, Issue 2
            April 2009
            98 pages
            ISSN:1559-1131
            EISSN:1559-114X
            DOI:10.1145/1513876
            Issue’s Table of Contents

            Copyright © 2009 ACM

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 30 April 2009
            • Accepted: 1 February 2009
            • Revised: 1 August 2008
            • Received: 1 February 2008
            Published in tweb Volume 3, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!