Abstract
Query-URL relevance, measuring the relevance of each retrieved URL with respect to a given query, is one of the fundamental criteria to evaluate the performance of commercial search engines. The traditional way to collect reliable and accurate query-URL relevance requires multiple annotators to provide their individual judgments based on their subjective expertise (e.g., understanding of user intents). In this case, the annotators’ subjectivity reflected in each annotator individual judgment (AIJ) inevitably affects the quality of the ground truth relevance (GTR). But to the best of our knowledge, the potential impact of AIJs on estimating GTRs has not been studied and exploited quantitatively by existing work. This article first studies how multiple AIJs and GTRs are correlated. Our empirical studies find that the multiple AIJs possibly provide more cues to improve the accuracy of estimating GTRs. Inspired by this finding, we then propose a novel approach to integrating the multiple AIJs with the features characterizing query-URL pairs for estimating GTRs more accurately. Furthermore, we conduct experiments in a commercial search engine—Baidu.com—and report significant gains in terms of the normalized discounted cumulative gains.
- E. Agichtein, E. Brill, and S. Dumais. 2006. Improving Web search ranking by incorporating user behavior information. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’06). 19--26. Google Scholar
Digital Library
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, and E. Yilmaz. 2008. Relevance assessment: Are judges exchangeable and does it matter. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’08). 667--674. Google Scholar
Digital Library
- R. Blanco, H. Halpin, D. Herzig, P. Mika, J. Pound, and H. S. Thompson. 2011. Repeatable and reliable search system evaluation using crowdsourcing. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 923--932. Google Scholar
Digital Library
- C. Buckley, M. Lease, and M. D. Smucker. 2010. Overview of the TREC 2010 Relevance Feedback Track (Notebook). Retrieved December 2, 2015, from https://www.ischool.utexas.edu/∼ml/papers/trec-notebook-2010.pdf.Google Scholar
- C. J. Burges, Q. V. Le, and R. Ragno. 2007. Learning to rank with nonsmooth cost functions. In Proceedings of the Neural Information Processing Systems Conference (NIPS’07). 193--200.Google Scholar
- B. Carterette, J. Allan, and R. Sitaraman. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’06). 268--275. Google Scholar
Digital Library
- O. Chapelle and Y. Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the JMLR Workshop (JMLR’11). 14:1--14:24.Google Scholar
- O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09). 620--631. Google Scholar
Digital Library
- W. Chen, Z. Ji, S. Shen, and Q. Yang. 2011. A whole page click model to better interpret search engine click data. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11).Google Scholar
- C. Cleverdon. 1997. The Cranfield tests on index language devices. In Readings in Information Retrieval. Morgan Kaufman, San Francisco, CA, 47--59. Google Scholar
Digital Library
- N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). 87--94. Google Scholar
Digital Library
- H. Deng, I. King, and M. R. Lyu. 2009. Entropy-biased models for query representation on the click graph. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). 339--346. Google Scholar
Digital Library
- H. Duan, K. Emre, and C. Zhai. 2012. Click patterns: An empirical representation of complex query intents. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’12). 1035--1044. Google Scholar
Digital Library
- G. Dupret and C. A. Liao. 2010. Model to estimate intrinsic document relevance from the click-through logs of a Web search engine. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 181--190. Google Scholar
Digital Library
- G. Dupret and B. Piwowarski. 2008. A user browsing model to predict search engine click data from past observations. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’08). 331--338. Google Scholar
Digital Library
- Q. Guo and E. Agichtein. 2012. Beyond dwell time: Estimating document relevance from cursor movements and other post-click searcher behavior. In Proceedings of the World Wide Web Conference (WWW’12). 569--578. Google Scholar
Digital Library
- A. Gao, Y. Bachrach, P. Key, and T. Graepel. 2012. Quality expectation-variance tradeoffs in crowdsourcing contests. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12).Google Scholar
- S. Goel, A. Broder, E. Gabrilovich, and B. Pang. 2010. Anatomy of the long tail: Ordinary people with extraordinary tastes. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 201--210. Google Scholar
Digital Library
- D. Harman. 2010. Is the Cranfield paradigm outdated? In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’10). 1. Google Scholar
Digital Library
- J. He, W. X. Zhao, B. Shu, X. M. Li, and H. F. Yan. 2011. Efficiently collecting relevance information from clickthroughs for Web retrieval system evaluation. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 275--284. Google Scholar
Digital Library
- B. T. Hu, Y. C. Zhang, W. Z. Chen, G. Wang, and Q. Yang. 2011. Characterize search intent diversity into click models. In Proceedings of the World Wide Web Conference (WWW’11). 17--26. Google Scholar
Digital Library
- J. Huang, R. W. White, G. Buscher, and K. Wang. 2012. Improving searcher models using mouse cursor activity. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’12). 195--204. Google Scholar
Digital Library
- P. Ipeirotis, F. Provost, and J. Wang. 2010. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOM’10). 64--67. Google Scholar
Digital Library
- K. Jarvelin and J. Kekalainen. 2000. IR evaluation methods for retrieving highly relevant documents. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’00). 41--48. Google Scholar
Digital Library
- T. Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02). 133--142. Google Scholar
Digital Library
- H. J. Jung and M. Lease. 2012. Inferring missing relevance judgments from crowd workers via probabilistic matrix factorization. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’12). 1095--1096. Google Scholar
Digital Library
- R. Jurca and B. Faltings. 2009. Mechanisms for making crowds truthful. Journal of Artificial Intelligence Research 34, 209--253. Google Scholar
Digital Library
- T. Kanungo and D. Orr. 2009. Predicting the readability of short Web summaries. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM’09). 202--211. Google Scholar
Digital Library
- G. Kazai, J. Kamps, M. Koolen, and N. Milic-Frayling. 2011. Crowdsourcing for book search evaluation: Impact of HIT design on comparative system ranking. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’11). 205--214. Google Scholar
Digital Library
- J. Le, A. Edmonds, V. Hester, and L. Biewald. 2010. Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution. In Proceedings of the Workshop on Crowdsourcing for Search Evaluation (SIGIR’10). 21--26.Google Scholar
- S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10, 1345--1359. Google Scholar
Digital Library
- V. C. Raykar and S. Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research 13, 491--518. Google Scholar
Digital Library
- V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. 2009. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proceedings of the 26th International Conference on Machine Learning (ICML’09). 889--896. Google Scholar
Digital Library
- V. C. Raykar, S. Yu, L. H. Zhao, and G. H. Valadez. 2010. Learning from crowds. Journal of Machine Learning Research 11, 1297--1322. Google Scholar
Digital Library
- C. Saunders, A. Gammerman, and V. Vovk. 1998. Ridge regression learning algorithm in dual variables. In Proceedings of the 15th International Conference on Machine Learning (ICML’98). 515--521. Google Scholar
Digital Library
- V. S. Sheng, F. Provost, and P. G. Lpeirotis. 2008. Get another label? Improving data quality and data mining using multiple noisy labelers. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’08). 614--622. Google Scholar
Digital Library
- H. J. Song, R. X. Liao, X. L. Zhang, C. Y. Miao, and Q. Yang. 2012. A mouse-trajectory based model for predicting query-URL relevance. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12). 143--149.Google Scholar
- H. J. Song, C. Y. Miao, and Z. Q. Shen. 2011. Generating true relevance labels in Chinese search engine using clickthrough data. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11). 1230--1236.Google Scholar
- R. Srikant, S. Basu, N. Wang, and D. Pregibon. 2010. User browsing models: Relevance versus examination. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’10). 223--232. Google Scholar
Digital Library
- F. Xia, T. Y. Liu, J. Wang, W. Zhang, and H. Li. 2008. Listwise approach to learning to rank: Theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). 1192--1199. Google Scholar
Digital Library
- L. Xiao, G. R. Xue, W. Y. Dai, Y. Jiang, Q. Yang, and Y. Yu. 2008. Can Chinese Web pages be classified with English data source? In Proceedings of the World Wide Web Conference (WWW’08). 969--978. Google Scholar
Digital Library
- J. Xu, C. Chen, G. Xu, H. Li, and E. Abib. 2010. Improving quality of training data for learning to rank using click-through data. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 171--180. Google Scholar
Digital Library
- H. Yang, A. Mityagin, and K. M. Svore. 2010. Collecting high quality overlapping labels at low cost. In Proceedings of the Annual International ACM SIGIR Conference (SIGIR’10). 459--466. Google Scholar
Digital Library
- Y. Zhang, W. Chen, D. Wang, and Q. Yang. 2011. User-click modeling for understanding and predicting search-behavior. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’11). 1388--1396. Google Scholar
Digital Library
Index Terms
Individual Judgments Versus Consensus: Estimating Query-URL Relevance
Recommendations
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Using web-graph distance for relevance feedback in web search
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrievalWe study the effect of user supplied relevance feedback in improving web search results. Rather than using query refinement or document similarity measures to rerank results, we show that the web-graph distance between two documents is a robust measure ...
Are search engine users equally reliable?
WWW '10: Proceedings of the 19th international conference on World wide webIn this paper, we study on the reliability of search engine users using click-through data. We proposed a graph-based approach to evaluate user reliability according to how users click on search result lists. We tried to incorporate this measure of ...






Comments