Abstract
In this article we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year, we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log influence the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.
- Anh, V. N. and Moffat, A. 2006. Pruned query evaluation using pre-computed impacts. In Proceedings of the 29th International ACM Conference on Research and Development in Information Retrieval (SIGIR'06). ACM, New York, NY, 372--379. Google Scholar
Digital Library
- Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V., Plachouras, V., and Silvestri, F. 2007. The impact of caching on search engines. In Proceedings of the 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, NY, 183--190. Google Scholar
Digital Library
- Baeza-Yates, R., Junqueira, F., Plachouras, V., and Witschel, H. F. 2007. Admission policies for caches of search engine results. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE'07). Lecture Notes in Computer Science, Vol. 4726, 74--85. Google Scholar
Digital Library
- Baeza-Yates, R. and Saint-Jean, F. 2003. A three level search engine index based in query log distribution. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval (SPIRE'03). Lecture Notes in Computer Science, Vol. 2857, 56--65.Google Scholar
Cross Ref
- Beitzel, S. M., Jensen, E. C., Chowdhury, A., Grossman, D., and Frieder, O. 2004. Hourly analysis of a very large topically categorized web query log. In Proceedings of the 27th International ACM Conference on Research and Development in Information Retrieval (SIGIR'04). ACM, New York, NY, 321--328. Google Scholar
Digital Library
- Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: a scalable fully distributed web crawler. Softw. Pract. Exper. 34, 8. Google Scholar
Digital Library
- Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of the 8th International ACM Conference on Research and Development in Information Retrieval (SIGIR'85). ACM, New York, NY, 97--110. Google Scholar
Digital Library
- Büttcher, S. and Clarke, C. L. A. 2006. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06). ACM, New York, NY, 182--189. Google Scholar
Digital Library
- Cao, P. and Irani, S. 1997. Cost-aware WWW proxy caching algorithms. In USENIX Symposium on Internet Technologies and Systems. Google Scholar
Digital Library
- Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006. A reference collection for web spam. SIGIR Forum 40, 2, 11--24. Google Scholar
Digital Library
- Denning, P. 1980. Working sets past and present. IEEE Trans. Softw. Eng. SE-6, 1, 64--84. Google Scholar
Digital Library
- Fagni, T., Perego, R., Silvestri, F., and Orlando, S. 2006. Boosting the performance of web search engines: caching and prefetching query results by exploiting historical usage data. ACM Trans. Inform. Syst. 24, 1, 51--78. Google Scholar
Digital Library
- Jansen, B. and Spink, A. 2006. How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Inform. Process. Manag. 42, 248--263. Google Scholar
Digital Library
- Jansen, B. J., Spink, A., Bateman, J., and Saracevic, T. 1998. Real life information retrieval: a study of user queries on the web. SIGIR Forum 32, 1, 5--17. Google Scholar
Digital Library
- Lempel, R. and Moran, S. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International World Wide Web Conference (WWW'03). ACM, New York, NY, 19--28. Google Scholar
Digital Library
- Long, X. and Suel, T. 2005. Three-level caching for efficient query processing in large web search engines. In Proceedings of the 14th International World Wide Web Conference (WWW'05). ACM, New York, NY, 257--266. Google Scholar
Digital Library
- Markatos, E. P. 2001. On caching search engine query results. Comput. Commun. 24, 2, 137--143. Google Scholar
Digital Library
- Ntoulas, A. and Cho, J. 2007. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, NY, 191--198. Google Scholar
Digital Library
- Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., and Lioma, C. 2006. Terrier: a high performance and scalable information retrieval platform. In SIGIR Workshop on Open Source Information Retrieval.Google Scholar
- Podlipnig, S. and Boszormenyi, L. 2003. A survey of web cache replacement strategies. ACM Comput. Surv. 35, 4, 374--398. Google Scholar
Digital Library
- Raghavan, V. V. and Sever, H. 1995. On the reuse of past optimal queries. In Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95). ACM, New York, NY, 344--350. Google Scholar
Digital Library
- Saraiva, P. C., de Moura, E. S., Ziviani, N., Meira, W., Fonseca, R., and Riberio-Neto, B. 2001. Rank-preserving two-level caching for scalable search engines. In Proceedings of the 24th International ACM Conference on Research and Development in Information Retrieval (SIGIR'01). ACM, New York, NY, 51--58. Google Scholar
Digital Library
- Silverstein, C., Marais, H., Henzinger, M., and Moricz, M. 1999. Analysis of a very large web search engine query log. SIGIR Forum 33, 1, 6--12. Google Scholar
Digital Library
- Slutz, D. R. and Traiger, I. L. 1974. A note on the calculation of average working set size. Commun. ACM 17, 10, 563--565. Google Scholar
Digital Library
- Strohman, T., Turtle, H., and Croft, W. B. 2005. Optimization strategies for complex queries. In Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'05). ACM, New York, NY, 219--225. Google Scholar
Digital Library
- Tsegay, Y., Turpin, A., and Zobel, J. 2007. Dynamic index pruning for effective caching. In Proceedings of the 16th ACM conference on Conference on Information and Knowledge Management (CIKM'07). ACM, New York, NY, 987--990. Google Scholar
Digital Library
- Witten, I. H., Bell, T. C., and Moffat, A. 1994. Managing Gigabytes: Compressing and Indexing Documents and Images. John Wiley & Sons, Inc., New York, NY. Google Scholar
Digital Library
- Xie, Y. and O'Hallaron, D. R. 2002. Locality in search engine queries and its implications for caching. In Proceedings of the 21st Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM'02).Google Scholar
- Young, N. E. 2002. On-line file caching. Algorithmica 33, 3, 371--383.Google Scholar
- Zhang, J., Long, X., and Suel, T. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th International World Wide Web Conference (WWW'08). ACM, New York, NY, 387--396. Google Scholar
Digital Library
Index Terms
Design trade-offs for search engine caching
Recommendations
A refreshing perspective of search engine caching
WWW '10: Proceedings of the 19th international conference on World wide webCommercial Web search engines have to process user queries over huge Web indexes under tight latency constraints. In practice, to achieve low latency, large result caches are employed and a portion of the query traffic is served using previously ...
The impact of caching on search engines
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalIn this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs.caching posting lists. Using a query log ...
Three-Level Caching for Efficient Query Processing in Large Web Search Engines
Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of ...






Comments