ABSTRACT
In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query's underlying concepts that compose its original segmented form. The model's parameters are estimated using an expectation-maximization (EM) algorithm, optimizing the minimum description length objective function on a partial corpus that is specific to the query. To augment this unsupervised learning, we incorporate evidence from Wikipedia.
Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4% improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46% improvement (from 0.530 to 0.774).
- R. K. Ando and L. Lee. Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences. Journal of Natural Language Engineering, 9:127--149, 2003. Google Scholar
Digital Library
- P. Anick. Exploiting clustering and phrases for context-based information retrieval. ACM SIGIR Forum, 31:314--323, 1997. Google Scholar
Digital Library
- A. Arampatzis, T. van der Weide, C. Koster, and P. van Bommel. Phrase-based Information Retrieval. Information Processing and Management, 34(6):693--707, 1998. Google Scholar
Digital Library
- S. Bergsma and Q. I. Wang. Learning Noun Phrase Query Segmentation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 819--826, 2007.Google Scholar
- M. R. Brent and T. A. Cartwright. Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61:93--125, 1996.Google Scholar
Cross Ref
- E. Brill and G. Ngai. Man vs. Machine: A Case Study in Base Noun Phrase Learning. In Proceedings of 37th Annual Meeting of the Association for Computational Linguistics (ACL), pages 65--72, 1999. Google Scholar
Digital Library
- R. Bunescu and M. Pasca. Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proceedings of 11th Conference of European Chapter of the Association for Computational Linguistics (EACL), pages 9--16, 2006.Google Scholar
- Y. Chang, I. Ounis, and M. Kim. Query Reformulation Using Automatically Generated Query Concepts from a Document Space. Information Processing and Management, 42 (2):453--468, 2006. Google Scholar
Digital Library
- S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL 2007, pages 708--716, 2007.Google Scholar
- D. Ahn and V. Jijkoun and G. Mishne and K. Muller and M. de. Rijke. Using Wikipedia at the TREC QA Track. In The Thirteenth Text Retrieval Conference (TREC 2004), 2005.Google Scholar
- D. Evans and C. Zhai. Noun-phrase Analysis in Unrestricted Text for Information Retrieval. In 34th Annual Meeting of the Association Computational Linguistics (ACL), pages 17--24, 1996. Google Scholar
Digital Library
- E. Gabrilovich and S. Markovitch. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI), pages 1301--1306, 2006. Google Scholar
Digital Library
- P. Grunwald. A Minimum Description Length Approach to Grammar Inference. In G. S. S. Wermter, E. Riloff, editor, Symbolic, Connectionist and Statistical Approaches to Learning for Natural Language Processing, pages 203--216, 1996. Google Scholar
Digital Library
- R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proceedings of 15th International World Wide Web Conference (WWW), pages 387--396, 2006. Google Scholar
Digital Library
- H. Li and N. Abe. Clustering Words with the MDL Principle. In Proceedings of the 16th International Conference on Computational Linguistics (COLING), pages 5--9, 1996. Google Scholar
Digital Library
- C. G. D. Marcken. Unsupervised language acquisition. PhD thesis, MIT, 1996. Supervisor: Robert C. Berwick. Google Scholar
Digital Library
- G. Pass, A. Chowdhury, and C. Torgeson. A Picture of Search. In The First International Conference on Scalable Information Systems, 2006. Google Scholar
Digital Library
- F. Peng, F. Feng, and A. McCallum. Chinese Segmentation and New Word Detection using Conditional Random Fields. In Proceedings of The 20th International Conference on Computational Linguistics (COLING), pages 562--568, 2004. Google Scholar
Digital Library
- F. Peng and D. Schuurmans. Self-Supervised Chinese Word Segmentation. In IDA '01: Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis, pages 238--247, 2001. Google Scholar
Digital Library
- S. P. Ponzetto and M. Strube. Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 192--199, 2006. Google Scholar
Digital Library
- Y. Qiu and H.-P. Frei. Concept based Query Expansion. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR), pages 160--169, 1993. Google Scholar
Digital Library
- L. Ramshaw and M. Marcus. Text Chunking Using Transformation-Based Learning. In Proceedings of the Third Workshop on Very Large Corpora, pages 82--94, 1995.Google Scholar
- K. M. Risvik, T. Mikolajewski, and P. Boros. Query Segmentation for Web Search. In The Twelfth International World Wide Web Conference (WWW), 2003.Google Scholar
- J. I. Serrano and L. Araujo. Statistical Recognition of Noun Phrases in Unrestricted Text. In IDA ?05: Proceedings of the 6th International Conference on Advances in Intelligent Data Analysis, pages 397--408, 2005. Google Scholar
Digital Library
- R. Sproat, W. Gale, C. Shih, and N. Chang. A Stochastic Finite-state Word Segmentation Algorithm for Chinese. Computational Linguistics, 22(3):377--404, 1996. Google Scholar
Digital Library
- C. Zhai. Fast Statistical Parsing of Noun Phrases for Document Indexing. In Proceedings of the fifth Conference on Applied Natural Language Processing (ANLP), pages 312--319, 1997. Google Scholar
Digital Library
Index Terms
Unsupervised query segmentation using generative language models and wikipedia
Recommendations
Unsupervised query segmentation using clickthrough for information retrieval
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalQuery segmentation is an important task toward understanding queries accurately, which is essential for improving search results. Existing segmentation models either use labeled data to predict the segmentation boundaries, for which the training data is ...
Query segmentation revisited
WWW '11: Proceedings of the 20th international conference on World wide webWe address the problem of query segmentation: given a keyword query, the task is to group the keywords into phrases, if possible. Previous approaches to the problem achieve reasonable segmentation performance but are tested only against a small corpus ...
Unsupervised query segmentation using click data: preliminary results
WWW '10: Proceedings of the 19th international conference on World wide webWe describe preliminary results of experiments with an unsupervised framework for query segmentation, transforming keyword queries into structured queries. The resulting queries can be used to more accurately search product databases, and potentially ...





Comments