skip to main content
10.1145/1367497.1367545acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Unsupervised query segmentation using generative language models and wikipedia

Published:21 April 2008Publication History

ABSTRACT

In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query's underlying concepts that compose its original segmented form. The model's parameters are estimated using an expectation-maximization (EM) algorithm, optimizing the minimum description length objective function on a partial corpus that is specific to the query. To augment this unsupervised learning, we incorporate evidence from Wikipedia.

Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4% improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46% improvement (from 0.530 to 0.774).

References

  1. R. K. Ando and L. Lee. Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences. Journal of Natural Language Engineering, 9:127--149, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Anick. Exploiting clustering and phrases for context-based information retrieval. ACM SIGIR Forum, 31:314--323, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arampatzis, T. van der Weide, C. Koster, and P. van Bommel. Phrase-based Information Retrieval. Information Processing and Management, 34(6):693--707, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Bergsma and Q. I. Wang. Learning Noun Phrase Query Segmentation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 819--826, 2007.Google ScholarGoogle Scholar
  5. M. R. Brent and T. A. Cartwright. Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61:93--125, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  6. E. Brill and G. Ngai. Man vs. Machine: A Case Study in Base Noun Phrase Learning. In Proceedings of 37th Annual Meeting of the Association for Computational Linguistics (ACL), pages 65--72, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Bunescu and M. Pasca. Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proceedings of 11th Conference of European Chapter of the Association for Computational Linguistics (EACL), pages 9--16, 2006.Google ScholarGoogle Scholar
  8. Y. Chang, I. Ounis, and M. Kim. Query Reformulation Using Automatically Generated Query Concepts from a Document Space. Information Processing and Management, 42 (2):453--468, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL 2007, pages 708--716, 2007.Google ScholarGoogle Scholar
  10. D. Ahn and V. Jijkoun and G. Mishne and K. Muller and M. de. Rijke. Using Wikipedia at the TREC QA Track. In The Thirteenth Text Retrieval Conference (TREC 2004), 2005.Google ScholarGoogle Scholar
  11. D. Evans and C. Zhai. Noun-phrase Analysis in Unrestricted Text for Information Retrieval. In 34th Annual Meeting of the Association Computational Linguistics (ACL), pages 17--24, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Gabrilovich and S. Markovitch. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI), pages 1301--1306, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Grunwald. A Minimum Description Length Approach to Grammar Inference. In G. S. S. Wermter, E. Riloff, editor, Symbolic, Connectionist and Statistical Approaches to Learning for Natural Language Processing, pages 203--216, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proceedings of 15th International World Wide Web Conference (WWW), pages 387--396, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Li and N. Abe. Clustering Words with the MDL Principle. In Proceedings of the 16th International Conference on Computational Linguistics (COLING), pages 5--9, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. G. D. Marcken. Unsupervised language acquisition. PhD thesis, MIT, 1996. Supervisor: Robert C. Berwick. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Pass, A. Chowdhury, and C. Torgeson. A Picture of Search. In The First International Conference on Scalable Information Systems, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Peng, F. Feng, and A. McCallum. Chinese Segmentation and New Word Detection using Conditional Random Fields. In Proceedings of The 20th International Conference on Computational Linguistics (COLING), pages 562--568, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Peng and D. Schuurmans. Self-Supervised Chinese Word Segmentation. In IDA '01: Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis, pages 238--247, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. P. Ponzetto and M. Strube. Exploiting Semantic Role Labeling, WordNet and Wikipedia for Coreference Resolution. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 192--199, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Qiu and H.-P. Frei. Concept based Query Expansion. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR), pages 160--169, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Ramshaw and M. Marcus. Text Chunking Using Transformation-Based Learning. In Proceedings of the Third Workshop on Very Large Corpora, pages 82--94, 1995.Google ScholarGoogle Scholar
  23. K. M. Risvik, T. Mikolajewski, and P. Boros. Query Segmentation for Web Search. In The Twelfth International World Wide Web Conference (WWW), 2003.Google ScholarGoogle Scholar
  24. J. I. Serrano and L. Araujo. Statistical Recognition of Noun Phrases in Unrestricted Text. In IDA ?05: Proceedings of the 6th International Conference on Advances in Intelligent Data Analysis, pages 397--408, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Sproat, W. Gale, C. Shih, and N. Chang. A Stochastic Finite-state Word Segmentation Algorithm for Chinese. Computational Linguistics, 22(3):377--404, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Zhai. Fast Statistical Parsing of Noun Phrases for Document Indexing. In Proceedings of the fifth Conference on Applied Natural Language Processing (ANLP), pages 312--319, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Unsupervised query segmentation using generative language models and wikipedia

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '08: Proceedings of the 17th international conference on World Wide Web
      April 2008
      1326 pages
      ISBN:9781605580852
      DOI:10.1145/1367497

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 April 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

      Upcoming Conference

      WWW '24
      The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore , Singapore

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader