skip to main content
research-article

Individual Judgments Versus Consensus: Estimating Query-URL Relevance

Published:09 January 2016Publication History
Skip Abstract Section

Abstract

Query-URL relevance, measuring the relevance of each retrieved URL with respect to a given query, is one of the fundamental criteria to evaluate the performance of commercial search engines. The traditional way to collect reliable and accurate query-URL relevance requires multiple annotators to provide their individual judgments based on their subjective expertise (e.g., understanding of user intents). In this case, the annotators’ subjectivity reflected in each annotator individual judgment (AIJ) inevitably affects the quality of the ground truth relevance (GTR). But to the best of our knowledge, the potential impact of AIJs on estimating GTRs has not been studied and exploited quantitatively by existing work. This article first studies how multiple AIJs and GTRs are correlated. Our empirical studies find that the multiple AIJs possibly provide more cues to improve the accuracy of estimating GTRs. Inspired by this finding, we then propose a novel approach to integrating the multiple AIJs with the features characterizing query-URL pairs for estimating GTRs more accurately. Furthermore, we conduct experiments in a commercial search engine—Baidu.com—and report significant gains in terms of the normalized discounted cumulative gains.

References

  1. E. Agichtein, E. Brill, and S. Dumais. 2006. Improving Web search ranking by incorporating user behavior information. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’06). 19--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Bailey, N. Craswell, I. Soboroff, P. Thomas, and E. Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’08). 667--674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Blanco, H. Halpin, D. Herzig, P. Mika, J. Pound, and H. S. Thompson. 2011. Repeatable and reliable search system evaluation using crowdsourcing. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 923--932. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Buckley, M. Lease, and M. D. Smucker. 2010. Overview of the TREC 2010 Relevance Feedback Track (Notebook). Retrieved December 2, 2015, from https://www.ischool.utexas.edu/∼ml/papers/trec-notebook-2010.pdf.Google ScholarGoogle Scholar
  5. C. J. Burges, Q. V. Le, and R. Ragno. 2007. Learning to rank with nonsmooth cost functions. In Proceedings of the Neural Information Processing Systems Conference (NIPS’07). 193--200.Google ScholarGoogle Scholar
  6. B. Carterette, J. Allan, and R. Sitaraman. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’06). 268--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. O. Chapelle and Y. Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the JMLR Workshop (JMLR’11). 14:1--14:24.Google ScholarGoogle Scholar
  8. O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). 620--631. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Chen, Z. Ji, S. Shen, and Q. Yang. 2011. A whole page click model to better interpret search engine click data. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11).Google ScholarGoogle Scholar
  10. C. Cleverdon. 1997. The Cranfield tests on index language devices. In Readings in Information Retrieval. Morgan Kaufman, San Francisco, CA, 47--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). 87--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Deng, I. King, and M. R. Lyu. 2009. Entropy-biased models for query representation on the click graph. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 339--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Duan, K. Emre, and C. Zhai. 2012. Click patterns: An empirical representation of complex query intents. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’12). 1035--1044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Dupret and C. A. Liao. 2010. Model to estimate intrinsic document relevance from the click-through logs of a Web search engine. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 181--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Dupret and B. Piwowarski. 2008. A user browsing model to predict search engine click data from past observations. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’08). 331--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Q. Guo and E. Agichtein. 2012. Beyond dwell time: Estimating document relevance from cursor movements and other post-click searcher behavior. In Proceedings of the World Wide Web Conference (WWW’12). 569--578. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Gao, Y. Bachrach, P. Key, and T. Graepel. 2012. Quality expectation-variance tradeoffs in crowdsourcing contests. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12).Google ScholarGoogle Scholar
  18. S. Goel, A. Broder, E. Gabrilovich, and B. Pang. 2010. Anatomy of the long tail: Ordinary people with extraordinary tastes. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 201--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Harman. 2010. Is the Cranfield paradigm outdated? In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’10). 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. He, W. X. Zhao, B. Shu, X. M. Li, and H. F. Yan. 2011. Efficiently collecting relevance information from clickthroughs for Web retrieval system evaluation. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 275--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. T. Hu, Y. C. Zhang, W. Z. Chen, G. Wang, and Q. Yang. 2011. Characterize search intent diversity into click models. In Proceedings of the World Wide Web Conference (WWW’11). 17--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Huang, R. W. White, G. Buscher, and K. Wang. 2012. Improving searcher models using mouse cursor activity. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). 195--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Ipeirotis, F. Provost, and J. Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOM’10). 64--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Jarvelin and J. Kekalainen. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’00). 41--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). 133--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. J. Jung and M. Lease. 2012. Inferring missing relevance judgments from crowd workers via probabilistic matrix factorization. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’12). 1095--1096. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Jurca and B. Faltings. 2009. Mechanisms for making crowds truthful. Journal of Artificial Intelligence Research 34, 209--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Kanungo and D. Orr. 2009. Predicting the readability of short Web summaries. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). 202--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. 2011. Crowdsourcing for book search evaluation: Impact of HIT design on comparative system ranking. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 205--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Le, A. Edmonds, V. Hester, and L. Biewald. 2010. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In Proceedings of the Workshop on Crowdsourcing for Search Evaluation (SIGIR’10). 21--26.Google ScholarGoogle Scholar
  31. S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10, 1345--1359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. V. C. Raykar and S. Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research 13, 491--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. 2009. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th International Conference on Machine Learning (ICML’09). 889--896. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. V. C. Raykar, S. Yu, L. H. Zhao, and G. H. Valadez. 2010. Learning from crowds. Journal of Machine Learning Research 11, 1297--1322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Saunders, A. Gammerman, and V. Vovk. 1998. Ridge regression learning algorithm in dual variables. In Proceedings of the 15th International Conference on Machine Learning (ICML’98). 515--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. V. S. Sheng, F. Provost, and P. G. Lpeirotis. 2008. Get another label? Improving data quality and data mining using multiple noisy labelers. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). 614--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. J. Song, R. X. Liao, X. L. Zhang, C. Y. Miao, and Q. Yang. 2012. A mouse-trajectory based model for predicting query-URL relevance. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12). 143--149.Google ScholarGoogle Scholar
  38. H. J. Song, C. Y. Miao, and Z. Q. Shen. 2011. Generating true relevance labels in Chinese search engine using clickthrough data. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11). 1230--1236.Google ScholarGoogle Scholar
  39. R. Srikant, S. Basu, N. Wang, and D. Pregibon. 2010. User browsing models: Relevance versus examination. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’10). 223--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. F. Xia, T. Y. Liu, J. Wang, W. Zhang, and H. Li. 2008. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). 1192--1199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. L. Xiao, G. R. Xue, W. Y. Dai, Y. Jiang, Q. Yang, and Y. Yu. 2008. Can Chinese Web pages be classified with English data source? In Proceedings of the World Wide Web Conference (WWW’08). 969--978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Xu, C. Chen, G. Xu, H. Li, and E. Abib. 2010. Improving quality of training data for learning to rank using click-through data. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. H. Yang, A. Mityagin, and K. M. Svore. 2010. Collecting high quality overlapping labels at low cost. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’10). 459--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Y. Zhang, W. Chen, D. Wang, and Q. Yang. 2011. User-click modeling for understanding and predicting search-behavior. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’11). 1388--1396. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Individual Judgments Versus Consensus: Estimating Query-URL Relevance

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 10, Issue 1
      February 2016
      198 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/2870642
      Issue’s Table of Contents

      Copyright © 2016 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 January 2016
      • Accepted: 1 October 2015
      • Revised: 1 September 2015
      • Received: 1 May 2013
      Published in tweb Volume 10, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!