Abstract
Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it).
Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logs when working directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10--30% improvement on distance estimations, which in turn leads to 25--80% improvement on the search performance.
Supplemental Material
Available for Download
Online appendix to Optimal distance bounds for fast search on compressed time-series query logs on article 6.
- Adar, E., Weld, D., Bershad, B., and Gribble, S. 2007. Why we search: Visualizing and predicting user behavior. In Proceedings of the International World Wide Web Conference. Google Scholar
Digital Library
- Agrawal, R., Faloutsos, C., and Swami, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the International Conference on Foundations of Data Organization and Algorithms (FODO). 69--84. Google Scholar
Digital Library
- Assent, I., Krieger, R., Afschari, F., and Seidl, T. 2008. The TS-tree: Efficient time series search and retrieval. In Proceedings of the International Conference on Extending Database Technology (EDBT). 252--263. Google Scholar
Digital Library
- Beitzel, S., Jensen, E., Chowdhury, A., Frieder, O., and Grossman, D. 2007. Temporal analysis of a very large topically categorized Web query log. In J. Amer. Soc. Inform. Sci. Tech. 58, 2, 166--178. Google Scholar
Digital Library
- Bertsekas, D. P. 2000. Nonlinear Programming. Athena Scientific.Google Scholar
- Bollegala, D., Matsuo, Y., and Ishizuka, M. 2007. Measuring semantic similarity between words using Web search engines. In Proceedings of the International World Wide Web Conference. 757--766. Google Scholar
Digital Library
- Bozkaya, T. and Özsoyoglu, M. 1997. Distance-based indexing for high-dimensional metric spaces. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 357--368. Google Scholar
Digital Library
- Cai, Y. and Ng, R. 2004. Indexing spatio-temporal trajectories with chebyshev polynomials. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 599--610. Google Scholar
Digital Library
- Chee Fu, A. W., Chan, P. M., Cheung, Y.-L., and Moon, Y. 2000. Dynamic VP-Tree indexing for N-nearest neighbor search given pair-wise distances. J. VLDB, 154--173. Google Scholar
Digital Library
- Chen, Q., Li, M., and Zhou, M. 2007. Improving query spelling correction using Web search results. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language. 181--189.Google Scholar
- Chien, S. and Immorlica, N. 2005. Semantic similarity between search engine queries using temporal correlation. In Proceedings of the International World Wide Web Conference. 2--11. Google Scholar
Digital Library
- Cover, T. M. and Thomas, J. A. 1991. Elements of Information Theory. John Wiley and Sons. Google Scholar
Digital Library
- Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., and Tomkins, A. 2006. Visualizing tags over time. In Proceedings of the International World Wide Web Conference. 193--202. Google Scholar
Digital Library
- Guttman, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 47--57. Google Scholar
Digital Library
- Hall, E. A. M. C. K. 2009. Gazpacho and summer rash: Lexical relationships from temporal patterns of web search queries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1046--1055. Google Scholar
Digital Library
- Kage, T. and Sumiya, K. 2006. A Web search method based on the temporal relation of query keywords. In Proceedings of the Web Information Systems. 4--15. Google Scholar
Digital Library
- Kahveci, T. and Singh, A. K. 2001. Variable length queries for time series data. In Proceedings of the IEEE International Conference on Data Engineering. 273--282. Google Scholar
Digital Library
- Keogh, E. 2002. Exact indexing of dynamic time warping. In Proceedings of the International Conference on Very Large Databases. 406--417. Google Scholar
Digital Library
- Keogh, E., Chakrabarti, K., Mehrotra, S., and Pazzani, M. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 151--162. Google Scholar
Digital Library
- Keogh, E. J. and Kasetty, S. 2003. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Min. Knowl. Discov. 7, 4, 349--371. Google Scholar
Digital Library
- Kleinberg, J. 2002. Bursty and hierarchical structure in streams. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 91--101. Google Scholar
Digital Library
- Lie, B., Jones, R., and Klinkner, K. 2006. Measuring the meaning in time series clustering of text search queries. In Proceedings of the International Conference on Information and Knowledge Management. 836--837. Google Scholar
Digital Library
- Liu, N., Nong, S., Yan, J., Zhang, B., Chen, Z., and Li, Y. 2006. Similarity of temporal query logs based on ARIMA model. In Proceedings of the IEEE International Conference on Data Mining. 975--979. Google Scholar
Digital Library
- Oppenheim, A., Willsky, A., and Nawab, S. 1997. Signals and Systems, 2nd Ed. Prentice Hall. Google Scholar
Digital Library
- Otsuka, S. and Kitsuregawa, M. 2006. Clustering of search engine keywords using access logs. In Proceedings of the International Conference on Database and Expert Systems Applications. 842--852. Google Scholar
Digital Library
- Otsuka, S., Toyoda, M., Hirai, J., and Kitsuregawa, M. 2004. Extracting user behavior by Web communities technology on global Web logs. In Proceedings of the International Conference on Database and Expert Systems Applications. 957--988.Google Scholar
- Rafiei, D. and Mendelzon, A. 1998. Efficient retrieval of similar time sequences using DFT. In Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 249--257.Google Scholar
- Rattenbury, T., Good, N., and Naaman, M. 2007. Towards extracting flickr tag semantics. Proceedings of the International World Wide Web Conference. 1287--1288. Google Scholar
Digital Library
- Richardson, M. 2008. Learning about the world through long-term query logs. ACM Trans. Web, 2, 4. Google Scholar
Digital Library
- Rode, H. and Hiemstra, D. 2006. Using query profiles for clarification. In Proceedings of the European Conference on Information Retrieval. 205--216. Google Scholar
Digital Library
- Sanderson, M. and Dumais, S. 2007. Examining repetition in user search behavior. In Proceedings of the European Conference on Information Retrieval. 597--604. Google Scholar
Digital Library
- Sun, Y., Liu, N., Xie, K., Yan, S., Zhang, B., and Chen, Z. 2007. Causal relation of queries from temporal logs. In Proceedings of the International World Wide Web Conference. 1141--142. Google Scholar
Digital Library
- Vlachos, M., Meek, C., Vagena, Z., and Gunopulos, D. 2004. Identification of similarities, periodicities & bursts for online search queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 131--142. Google Scholar
Digital Library
- Vlachos, M., Yu, P., and Castelli, V. 2005. On Periodicity detection and structural periodic similarity. In Proceedings of the SIAM International Conference on Data Mining.Google Scholar
- Wang, C. and Wang, X. S. 2000. Multilevel filtering for high dimensional nearest neighbor search. In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.Google Scholar
- Wang, X., Zhai, C., Hu, X., and Sproat, R. 2007. Mining correlated bursty topic patterns from coordinated text streams. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 784--793. Google Scholar
Digital Library
- Weber, R., Schek, H.-J., and Blott, S. 1998. A quantitative analysis and performance study for similarity search methods in high-dimensional spaces. In Proceedings of the International Conference on Very Large Database. 194--205. Google Scholar
Digital Library
- Yi, B. and Faloutsos, C. 2000. Fast time sequence indexing for arbitrary Lp norms. In Proceedings of the International Conference on Very Large Database. 385--394. Google Scholar
Digital Library
- Yianilos, P. 1992. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithm. 311--321. Google Scholar
Digital Library
- Zhao, Q., Hoi, S., Liu, T.-Y., Bhowmick, S. S., Lyu, M. R., and Ma, W.-Y. 2006. Time-dependent semantic similarity measure of queries using historical click-through data. In Proceedings of the International World Wide Web Conference. 543--552. Google Scholar
Digital Library
- Ziegler, C.-N., Simon, K., and Lausen, G. 2006. Automatic computation of semantic proximity using taxonomic knowledge. In Proceedings of the International Conference on Information and Knowledge Management. 465--474. Google Scholar
Digital Library
Index Terms
Optimal distance bounds for fast search on compressed time-series query logs
Recommendations
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Mining query subtopics from search log data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalMost queries in web search are ambiguous and multifaceted. Identifying the major senses and facets of queries from search log data, referred to as query subtopic mining in this paper, is a very important issue in web search. Through search log analysis, ...
Time series analysis of a Web search engine transaction log
In this paper, we use time series analysis to evaluate predictive scenarios using search engine transactional logs. Our goal is to develop models for the analysis of searchers' behaviors over time and investigate if time series analysis is a valid ...






Comments