Text categorization by boosting automatically extracted concepts
|
Tools and Resources
Share: |
|||||||||||||||||||||
ABSTRACT
AUTHORS
|
|
||||||||||||||||||||||||||||||||||||||||
| View colleagues of Lijuan Cai | |||||||||||||||||||||||||||||||||||||||||
|
|
|||||||||||||||||||||||||||||||||||||||
| View colleagues of Thomas Hofmann | ||||||||||||||||||||||||||||||||||||||||
REFERENCESNote: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
|
1
|
||
|
2
|
||
|
3
|
||
| |
4
|
|
|
5
|
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6):391--407, 1990.
|
|
|
6
|
||
|
7
|
||
|
8
|
||
|
9
|
||
|
10
|
||
|
11
|
||
|
12
|
S. T. Dumais. Using LSI for information filtering: TREC-3 experiments. In D. Harman, editor, The Third Text REtrieval Conference (TREC3) NIST Special Publication 1995.
|
|
| |
13
|
|
|
14
|
||
|
15
|
J. Kandola, N. Cristianini, and J. Shawe-Taylor. Learning semantic similarity. In Advances in Neural Information Processing Systems (to appear) volume 15, 2003.
|
|
|
16
|
T. Hofmann. Learning the similarity of documents. In MIT Press, editor, Advances in Neural Information Processing Systems volume 12, 2000.
|
|
|
17
|
||
| |
18
|
L. Douglas Baker , Andrew Kachites McCallum, Distributional clustering of words for text classification, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.96-103, August 24-28, 1998, Melbourne, Australia [doi>10.1145/290941.290970]
|
| |
19
|
Ron Bekkerman , Ran El-Yaniv , Naftali Tishby , Yoad Winter, On feature distributional clustering for text categorization, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.146-153, September 2001, New Orleans, Louisiana, USA [doi>10.1145/383952.383976]
|
|
20
|
David Lewis. Reuters-21578 dataset.
|
|
|
21
|
CITED BY32 Citations
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
INDEX TERMSThe ACM Computing Classification System (CCS rev.2012)
PUBLICATION| Title | SIGIR '03 Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval table of contents | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| General Chairs | Charles Clarke University of Waterloo, Canada | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Gordon Cormack University of Waterloo, Canada | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Program Chairs | Jamie Callan Carnegie Mellon University, Pittsburgh, PA | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| David Hawking Australian National University, Australia | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Alan Smeaton Dublin City University, Ireland | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Pages | 182-189 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Publication Date | 2003-07-28 (yyyy-mm-dd) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sponsor | SIGIR ACM Special Interest Group on Information Retrieval | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Publisher | ACM New York, NY, USA ©2003 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ISBN: 1-58113-646-3 Order Number: 534032 doi>10.1145/860435.860470 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Conference |
IRResearch and Development in Information Retrieval
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Paper Acceptance Rate 46 of 266 submissions, 17% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Overall Acceptance Rate 1,201 of 6,327 submissions, 19% | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
REVIEWS
COMMENTSBe the first to comment To Post a comment please sign in or create a free Web account
Table of Contents| Keynote Address - exploring, modeling, and using the web graph | ||
| Andrei Broder | ||
| Pages: 1-1 | ||
| doi>10.1145/860435.860436 | ||
Full text: PDF
|
||
|
The Web graph, meaning the graph induced by Web pages as nodes and their hyperlinks as directed edges, has become a fascinating object of study for many people: physicists, sociologists, mathematicians, computer scientists, and information retrieval ...
expand
|
||
| Salton Award Lecture - Information retrieval and computer science: an evolving relationship | ||
| W. Bruce Croft | ||
| Pages: 2-3 | ||
| doi>10.1145/860435.860437 | ||
Full text: PDF
|
||
|
Following the tradition of these acceptance talks, I will be
giving my thoughts on where our field is going. Any discussion of
the future of information retrieval (IR) research, however, needs
to be placed in the context of its history and relationship ...
expand
|
||
| SESSION: Retreval models | ||
| Bayesian extension to the language model for ad hoc information retrieval | ||
| Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping | ||
| Pages: 4-9 | ||
| doi>10.1145/860435.860439 | ||
Full text: PDF
|
||
|
We propose a Bayesian extension to the ad-hoc Language Model. Many smoothed estimators used for the multinomial query model in ad-hoc Language Models (including Laplace and Bayes-smoothing) are approximations to the Bayesian predictive distribution. ...
expand
|
||
| Beyond independent relevance: methods and evaluation metrics for subtopic retrieval | ||
| Cheng Xiang Zhai, William W. Cohen, John Lafferty | ||
| Pages: 10-17 | ||
| doi>10.1145/860435.860440 | ||
Full text: PDF
|
||
|
We present a non-traditional retrieval problem we call subtopic retrieval. The subtopic retrieval problem is concerned with finding documents that cover many different subtopics of a query topic. In such a problem, the utility of a document in ...
expand
|
||
| Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model | ||
| Jaime Teevan, David R. Karger | ||
| Pages: 18-25 | ||
| doi>10.1145/860435.860441 | ||
Full text: PDF
|
||
|
Much work in information retrieval focuses on using a model of documents and queries to derive retrieval algorithms. Model based development is a useful alternative to heuristic development because in a model the assumptions are explicit and can be examined ...
expand
|
||
| SESSION: Qusetion answering | ||
| Question classification using support vector machines | ||
| Dell Zhang, Wee Sun Lee | ||
| Pages: 26-32 | ||
| doi>10.1145/860435.860443 | ||
Full text: PDF
|
||
|
Question classification is very important for question answering. This paper presents our research work on automatic question classification through machine learning approaches. We have experimented with five machine learning algorithms: Nearest Neighbors ...
expand
|
||
| Structured use of external knowledge for event-based open domain question answering | ||
| Hui Yang, Tat-Seng Chua, Shuguang Wang, Chun-Keat Koh | ||
| Pages: 33-40 | ||
| doi>10.1145/860435.860444 | ||
Full text: PDF
|
||
|
One of the major problems in question answering (QA) is that the queries are either too brief or often do not contain most relevant terms in the target corpus. In order to overcome this problem, our earlier work integrates external knowledge extracted ...
expand
|
||
| Quantitative evaluation of passage retrieval algorithms for question answering | ||
| Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, Gregory Marton | ||
| Pages: 41-47 | ||
| doi>10.1145/860435.860445 | ||
Full text: PDF
|
||
|
Passage retrieval is an important component common to many question answering systems. Because most evaluations of question answering systems focus on end-to-end performance, comparison of common components becomes difficult. To address this shortcoming, ...
expand
|
||
| SESSION: Web | ||
| Building a web thesaurus from web link structure | ||
| Zheng Chen, Shengping Liu, Liu Wenyin, Geguang Pu, Wei-Ying Ma | ||
| Pages: 48-55 | ||
| doi>10.1145/860435.860447 | ||
Full text: PDF
|
||
|
Thesaurus has been widely used in many applications, including information retrieval, natural language processing, and question answering. In this paper, we propose a novel approach to automatically constructing a domain-specific thesaurus from the Web ...
expand
|
||
| Implicit link analysis for small web search | ||
| Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma, Hong-Jiang Zhang, Chao-Jun Lu | ||
| Pages: 56-63 | ||
| doi>10.1145/860435.860448 | ||
Full text: PDF
|
||
|
Current Web search engines generally impose link analysis-based re-ranking on web-page retrieval. However, the same techniques, when applied directly to small web search such as intranet and site search, cannot achieve the same performance because their ...
expand
|
||
| Query type classification for web document retrieval | ||
| In-Ho Kang, GilChang Kim | ||
| Pages: 64-71 | ||
| doi>10.1145/860435.860449 | ||
Full text: PDF
|
||
|
The heterogeneous Web exacerbates IR problems and short user queries make them worse. The contents of web documents are not enough to find good answer documents. Link information and URL information compensates for the insufficiencies of content information. ...
expand
|
||
| SESSION: Human interaction | ||
| Stuff I've seen: a system for personal information retrieval and re-use | ||
| Susan Dumais, Edward Cutrell, JJ Cadiz, Gavin Jancke, Raman Sarin, Daniel C. Robbins | ||
| Pages: 72-79 | ||
| doi>10.1145/860435.860451 | ||
Full text: PDF
|
||
|
Most information retrieval technologies are designed to facilitate information discovery. However, much knowledge work involves finding and re-using previously seen information. We describe the design and evaluation of a system, called Stuff I've ...
expand
|
||
| Search strategies in content-based image retrieval | ||
| Sharon McDonald, John Tait | ||
| Pages: 80-87 | ||
| doi>10.1145/860435.860452 | ||
Full text: PDF
|
||
|
This paper describes two studies that looked at users' ability to formulate visual queries with a Content-Based Image Retrieval system that uses dominant image colour as the primary indexing key. The first experiment examined users' performance with ...
expand
|
||
| Using terminological feedback for web search refinement: a log-based study | ||
| Peter Anick | ||
| Pages: 88-95 | ||
| doi>10.1145/860435.860453 | ||
Full text: PDF
|
||
|
Although interactive query reformulation has been actively studied in the laboratory, little is known about the actual behavior of web searchers who are offered terminological feedback along with their search results. We analyze log sessions for two ...
expand
|
||
| SESSION: Text categorization | ||
| A scalability analysis of classifiers in text categorization | ||
| Yiming Yang, Jian Zhang, Bryan Kisiel | ||
| Pages: 96-103 | ||
| doi>10.1145/860435.860455 | ||
Full text: PDF
|
||
|
Real-world applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including ...
expand
|
||
| A repetition based measure for verification of text collections and for text categorization | ||
| Dmitry V. Khmelev, William J. Teahan | ||
| Pages: 104-110 | ||
| doi>10.1145/860435.860456 | ||
Full text: PDF
|
||
|
We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively ...
expand
|
||
| Using asymmetric distributions to improve text classifier probability estimates | ||
| Paul N. Bennett | ||
| Pages: 111-118 | ||
| doi>10.1145/860435.860457 | ||
Full text: PDF
|
||
|
Text classifiers that give probability estimates are more readily applicable in a variety of scenarios. For example, rather than choosing one set decision threshold, they can be used in a Bayesian risk model to issue a run-time decision which minimizes ...
expand
|
||
| SESSION: Multimedia information retrieval | ||
| Automatic image annotation and retrieval using cross-media relevance models | ||
| J. Jeon, V. Lavrenko, R. Manmatha | ||
| Pages: 119-126 | ||
| doi>10.1145/860435.860459 | ||
Full text: PDF
|
||
|
Libraries have traditionally used manual image annotation for indexing and then later retrieving their image collections. However, manual image annotation is an expensive and labor intensive procedure and hence there has been great interest in coming ...
expand
|
||
| Modeling annotated data | ||
| David M. Blei, Michael I. Jordan | ||
| Pages: 127-134 | ||
| doi>10.1145/860435.860460 | ||
Full text: PDF
|
||
|
We consider the problem of modeling annotated data---data with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). We describe three hierarchical probabilistic mixture models ...
expand
|
||
| Experimental result analysis for a generative probabilistic image retrieval model | ||
| Thijs Westerveld, Arjen P. de Vries | ||
| Pages: 135-142 | ||
| doi>10.1145/860435.860461 | ||
Full text: PDF
|
||
|
The main conclusion from the metrics-based evaluation of video retrieval systems at TREC's video track is that non-interactive image retrieval from general collections using visual information only is not yet feasible. We show how a detailed analysis ...
expand
|
||
| SESSION: Structured documents | ||
| Combining document representations for known-item search | ||
| Paul Ogilvie, Jamie Callan | ||
| Pages: 143-150 | ||
| doi>10.1145/860435.860463 | ||
Full text: PDF
|
||
|
This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search. As this task is very similar to work in meta-search and data fusion, we adapt several hypotheses ...
expand
|
||
| Searching XML documents via XML fragments | ||
| David Carmel, Yoelle S. Maarek, Matan Mandelbrod, Yosi Mass, Aya Soffer | ||
| Pages: 151-158 | ||
| doi>10.1145/860435.860464 | ||
Full text: PDF
|
||
|
Most of the work on XML query and search has stemmed from the publishing and database communities, mostly for the needs of business applications. Recently, the Information Retrieval community began investigating the XML search issue to answer information ...
expand
|
||
| SESSION: Text representation | ||
| Word sense disambiguation in information retrieval revisited | ||
| Christopher Stokoe, Michael P. Oakes, John Tait | ||
| Pages: 159-166 | ||
| doi>10.1145/860435.860466 | ||
Full text: PDF
|
||
|
Word sense ambiguity is recognized as having a detrimental effect on the precision of information retrieval systems in general and web search systems in particular, due to the sparse nature of the queries involved. Despite continued research into the ...
expand
|
||
| Probabilistic term variant generator for biomedical terms | ||
| Yoshimasa Tsuruoka, Jun'ichi Tsujii | ||
| Pages: 167-173 | ||
| doi>10.1145/860435.860467 | ||
Full text: PDF
|
||
|
This paper presents an algorithm to generate possible variants for biomedical terms. The algorithm gives each variant its generation probability representing its plausibility, which is potentially useful for query and dictionary expansions. The probabilistic ...
expand
|
||
| SESSION: Text categorization | ||
| A maximal figure-of-merit learning approach to text categorization | ||
| Sheng Gao, Wen Wu, Chin-Hui Lee, Tat-Seng Chua | ||
| Pages: 174-181 | ||
| doi>10.1145/860435.860469 | ||
Full text: PDF
|
||
|
A novel maximal figure-of-merit (MFoM) learning approach to text categorization is proposed. Different from the conventional techniques, the proposed MFoM method attempts to integrate any performance metric of interest (e.g. accuracy, recall, precision, ...
expand
|
||
| Text categorization by boosting automatically extracted concepts | ||
| Lijuan Cai, Thomas Hofmann | ||
| Pages: 182-189 | ||
| doi>10.1145/860435.860470 | ||
Full text: PDF
|
||
|
Term-based representations of documents have found wide-spread use in information retrieval. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufficiently robust with ...
expand
|
||
| Robustness of regularized linear classification methods in text categorization | ||
| Jian Zhang, Yiming Yang | ||
| Pages: 190-197 | ||
| doi>10.1145/860435.860471 | ||
Full text: PDF
|
||
|
Real-world applications often require the classification of documents under situations of small number of features, mis-labeled documents and rare positive examples. This paper investigates the robustness of three regularized linear classification methods ...
expand
|
||
| SESSION: Human interaction | ||
| Building and applying a concept hierarchy representation of a user profile | ||
| Nikolaos Nanas, Victoria Uren, Anne De Roeck | ||
| Pages: 198-204 | ||
| doi>10.1145/860435.860473 | ||
Full text: PDF
|
||
|
Term dependence is a natural consequence of language use. Its successful representation has been a long standing goal for Information Retrieval research. We present a methodology for the construction of a concept hierarchy that takes into account the ...
expand
|
||
| Query length in interactive information retrieval | ||
| N. J. Belkin, D. Kelly, G. Kim, J.-Y. Kim, H.-J. Lee, G. Muresan, M.-C. Tang, X.-J. Yuan, C. Cool | ||
| Pages: 205-212 | ||
| doi>10.1145/860435.860474 | ||
Full text: PDF
|
||
|
Query length in best-match information retrieval (IR) systems is well known to be positively related to effectiveness in the IR task, when measured in experimental, non-interactive environments. However, in operational, interactive IR systems, query ...
expand
|
||
| Re-examining the potential effectiveness of interactive query expansion | ||
| Ian Ruthven | ||
| Pages: 213-220 | ||
| doi>10.1145/860435.860475 | ||
Full text: PDF
|
||
|
Much attention has been paid to the relative effectiveness of interactive query expansion versus automatic query expansion. Although interactive query expansion has the potential to be an effective means of improving a search, in this paper we show that, ...
expand
|
||
| SESSION: IR theory | ||
| Latent concepts and the number orthogonal factors in latent semantic analysis | ||
| Georges Dupret | ||
| Pages: 221-226 | ||
| doi>10.1145/860435.860477 | ||
Full text: PDF
|
||
|
We seek insight into Latent Semantic Indexing by establishing a method to identify the optimal number of factors in the reduced matrix for representing a keyword. This method is demonstrated empirically by duplicating all documents containing a term ...
expand
|
||
| A frequency-based and a poisson-based definition of the probability of being informative | ||
| Thomas Roelleke | ||
| Pages: 227-234 | ||
| doi>10.1145/860435.860478 | ||
Full text: PDF
|
||
|
This paper reports on theoretical investigations about the assumptions underlying the inverse document frequency (idf). We show that an intuitive idf-based probability function for the probability of a term being informative assumes disjoint ...
expand
|
||
| Table extraction using conditional random fields | ||
| David Pinto, Andrew McCallum, Xing Wei, W. Bruce Croft | ||
| Pages: 235-242 | ||
| doi>10.1145/860435.860479 | ||
Full text: PDF
|
||
|
The ability to find tables and extract information from them is a necessary component of data mining, question answering, and other information retrieval tasks. Documents often contain tables in order to communicate densely packed, multi-dimensional ...
expand
|
||
| SESSION: Filtering and retrieval models | ||
| Building a filtering test collection for TREC 2002 | ||
| Ian Soboroff, Stephen Robertson | ||
| Pages: 243-250 | ||
| doi>10.1145/860435.860481 | ||
Full text: PDF
|
||
|
Test collections for the filtering track in TREC have typically used either past sets of relevance judgments, or categorized collections such as Reuters Corpus Volume 1 or OHSUMED, because filtering systems need relevance judgments during the experiment ...
expand
|
||
| An empirical study on retrieval models for different document genres: patents and newspaper articles | ||
| Makoto Iwayama, Atsushi Fujii, Noriko Kando, Yuzo Marukawa | ||
| Pages: 251-258 | ||
| doi>10.1145/860435.860482 | ||
Full text: PDF
|
||
|
Reflecting the rapid growth in the utilization of large test collections for information retrieval since the 1990s, extensive comparative experiments have been performed to explore the effectiveness of various retrieval models. However, most collections ...
expand
|
||
| Collaborative filtering via gaussian probabilistic latent semantic analysis | ||
| Thomas Hofmann | ||
| Pages: 259-266 | ||
| doi>10.1145/860435.860483 | ||
Full text: PDF
|
||
|
Collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, i.e. a database of available user preferences. In this paper, we describe a new model-based algorithm designed for this task, which ...
expand
|
||
| SESSION: Clustering | ||
| Document clustering based on non-negative matrix factorization | ||
| Wei Xu, Xin Liu, Yihong Gong | ||
| Pages: 267-273 | ||
| doi>10.1145/860435.860485 | ||
Full text: PDF
|
||
|
In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis ...
expand
|
||
| ReCoM: reinforcement clustering of multi-type interrelated data objects | ||
| Jidong Wang, Huajun Zeng, Zheng Chen, Hongjun Lu, Li Tao, Wei-Ying Ma | ||
| Pages: 274-281 | ||
| doi>10.1145/860435.860486 | ||
Full text: PDF
|
||
|
Most existing clustering algorithms cluster highly related data objects such as Web pages and Web users separately. The interrelation among different types of data objects is either not considered, or represented by a static feature space and treated ...
expand
|
||
| A comparative study on content-based music genre classification | ||
| Tao Li, Mitsunori Ogihara, Qi Li | ||
| Pages: 282-289 | ||
| doi>10.1145/860435.860487 | ||
Full text: PDF
|
||
|
Content-based music genre classification is a fundamental component of music information retrieval systems and has been gaining importance and enjoying a growing amount of attention with the emergence of digital music on the Internet. Currently little ...
expand
|
||
| SESSION: Distributed information retrieval | ||
| Evaluating different methods of estimating retrieval quality for resource selection | ||
| Henrik Nottelmann, Norbert Fuhr | ||
| Pages: 290-297 | ||
| doi>10.1145/860435.860489 | ||
Full text: PDF
|
||
|
In a federated digital library system, it is too expensive to query every accessible library. Resource selection is the task to decide to which libraries a query should be routed. Most existing resource selection algorithms compute a library ranking ...
expand
|
||
| Relevant document distribution estimation method for resource selection | ||
| Luo Si, Jamie Callan | ||
| Pages: 298-305 | ||
| doi>10.1145/860435.860490 | ||
Full text: PDF
|
||
|
Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well ...
expand
|
||
| SETS: search enhanced by topic segmentation | ||
| Mayank Bawa, Gurmeet Singh Manku, Prabhakar Raghavan | ||
| Pages: 306-313 | ||
| doi>10.1145/860435.860491 | ||
Full text: PDF
|
||
|
We present SETS, an architecture for efficient search in peer-to-peer networks, building upon ideas drawn from machine learning and social network theory. The key idea is to arrange participating sites in a topic-segmented overlay ...
expand
|
||
| SESSION: Novelty and topic change | ||
| Retrieval and novelty detection at the sentence level | ||
| James Allan, Courtney Wade, Alvaro Bolivar | ||
| Pages: 314-321 | ||
| doi>10.1145/860435.860493 | ||
Full text: PDF
|
||
|
Previous research in novelty detection has focused on the task of finding novel material, given a set or stream of documents on a certain topic. This study investigates the more difficult two-part task defined by the TREC 2002 novelty track: given a ...
expand
|
||
| Domain-independent text segmentation using anisotropic diffusion and dynamic programming | ||
| Xiang Ji, Hongyuan Zha | ||
| Pages: 322-329 | ||
| doi>10.1145/860435.860494 | ||
Full text: PDF
|
||
|
This paper presents a novel domain-independent text segmentation method, which identifies the boundaries of topic changes in long text documents and/or text streams. The method consists of three components: As a preprocessing step, we eliminate the document-dependent ...
expand
|
||
| A System for new event detection | ||
| Thorsten Brants, Francine Chen, Ayman Farahat | ||
| Pages: 330-337 | ||
| doi>10.1145/860435.860495 | ||
Full text: PDF
|
||
|
We present a new method and system for performing the New Event Detection task, i.e., in one or multiple streams of news stories, all stories on a previously unseen (new) event are marked. The method is based on an incremental TF-IDF model. Our extensions ...
expand
|
||
| SESSION: Cross-lingual information retrieval | ||
| Probabilistic structured query methods | ||
| Kareem Darwish, Douglas W. Oard | ||
| Pages: 338-344 | ||
| doi>10.1145/860435.860497 | ||
Full text: PDF
|
||
|
Structured methods for query term replacement rely on separate estimates of term tes of replacement probabilities. Statistically significantfrequency and document frequency to compute a weight for each query term. This paper reviews prior work on structured ...
expand
|
||
| Fuzzy translation of cross-lingual spelling variants | ||
| Ari Pirkola, Jarmo Toivonen, Heikki Keskustalo, Kari Visala, Kalervo Järvelin | ||
| Pages: 345-352 | ||
| doi>10.1145/860435.860498 | ||
Full text: PDF
|
||
|
We will present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first stage, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated ...
expand
|
||
| Automatic transliteration for Japanese-to-English text retrieval | ||
| Yan Qu, Gregory Grefenstette, David A. Evans | ||
| Pages: 353-360 | ||
| doi>10.1145/860435.860499 | ||
Full text: PDF
|
||
|
For cross language information retrieval (CLIR) based on bilingual translation dictionaries, good performance depends upon lexical coverage in the dictionary. This is especially true for languages possessing few inter-language cognates, such as between ...
expand
|
||
| POSTER SESSION: Posters | ||
| On the effectiveness of evaluating retrieval systems in the absence of relevance judgments | ||
| Javed A. Aslam, Robert Savell | ||
| Pages: 361-362 | ||
| doi>10.1145/860435.860501 | ||
Full text: PDF
|
||
|
Soboroff, Nicholas and Cahan recently proposed a method for evaluating the performance of retrieval systems without relevance judgments. They demonstrated that the system evaluations produced by their methodology are correlated with actual evaluations ...
expand
|
||
| Resource selection and data fusion in multimedia distributed digital libraries | ||
| Jamie Callan, Fabio Crestani, Henrik Nottelmann, Pietro Pala, Xiao Mang Shou | ||
| Pages: 363-364 | ||
| doi>10.1145/860435.860502 | ||
Full text: PDF
|
||
| Transliteration of proper names in cross-language applications | ||
| Paola Virga, Sanjeev Khudanpur | ||
| Pages: 365-366 | ||
| doi>10.1145/860435.860503 | ||
Full text: PDF
|
||
| Toward a unification of text and link analysis | ||
| Brian D. Davison | ||
| Pages: 367-368 | ||
| doi>10.1145/860435.860504 | ||
Full text: PDF
|
||
|
This paper presents a simple yet profound idea. By thinking about the relationships between and within terms and documents, we can generate a richer representation that encompasses aspects of Web link analysis as well as text analysis techniques from ...
expand
|
||
| Investigating the relationship between language model perplexity and IR precision-recall measures | ||
| Leif Azzopardi, Mark Girolami, Keith van Risjbergen | ||
| Pages: 369-370 | ||
| doi>10.1145/860435.860505 | ||
Full text: PDF
|
||
|
An empirical study has been conducted investigating the relationship between the performance of an aspect based language model in terms of perplexity and the corresponding information retrieval performance obtained. It is observed, on the corpora considered, ...
expand
|
||
| Topic distillation using hierarchy concept tree | ||
| Ikkyu Choi, Minkoo Kim | ||
| Pages: 371-372 | ||
| doi>10.1145/860435.860506 | ||
Full text: PDF
|
||
|
In this paper, we propose a new approach for topic distillation on World Wide Web. Topic distillation is to find quality documents related to the user query topic. Our approach is based on Bharat's topic distillation algorithm [1]. We present the analysis ...
expand
|
||
| Using manually-built web directories for automatic evaluation of known-item retrieval | ||
| Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, David Grossman, Ophir Frieder | ||
| Pages: 373-374 | ||
| doi>10.1145/860435.860507 | ||
Full text: PDF
|
||
|
Information retrieval system evaluation is complicated by the need for manually assessed relevance judgments. Large manually-built directories on the web open the door to new evaluation procedures. By assuming that web pages are the known relevant items ...
expand
|
||
| Popular music retrieval by detecting mood | ||
| Yazhong Feng, Yueting Zhuang, Yunhe Pan | ||
| Pages: 375-376 | ||
| doi>10.1145/860435.860508 | ||
Full text: PDF
|
||
| Exploiting query history for document ranking in interactive information retrieval | ||
| Xuehua Shen, Cheng Xiang Zhai | ||
| Pages: 377-378 | ||
| doi>10.1145/860435.860509 | ||
Full text: PDF
|
||
|
In this poster,we incorporate user query history, as context information, to improve the retrieval performance in interactive retrieval. Experiments using the TREC data show that incorporating such context information indeed consistently improves the ...
expand
|
||
| Automatic ranking of retrieval systems in imperfect environments | ||
| Rabia Nuray, Fazli Can | ||
| Pages: 379-380 | ||
| doi>10.1145/860435.860510 | ||
Full text: PDF
|
||
|
The empirical investigation of the effectiveness of information retrieval (IR) systems requires a test collection, a set of query topics, and a set of relevance judgments made by human assessors for each query. Previous experiments show that differences ...
expand
|
||
| An investigation of broad coverage automatic pronoun resolution for information retrieval | ||
| Richard J. Edens, Helen L. Gaylard, Gareth J. F. Jones, Adenike M. Lam-Adesina | ||
| Pages: 381-382 | ||
| doi>10.1145/860435.860511 | ||
Full text: PDF
|
||
|
Term weighting methods have been shown to give significant increases in information retrieval performance. The presence of pronomial references in documents reduces the term frequencies of associated words with a consequent effect on term weights and ...
expand
|
||
| Syntactic features in question answering | ||
| Xiaoyan Li | ||
| Pages: 383-384 | ||
| doi>10.1145/860435.860512 | ||
Full text: PDF
|
||
|
Syntactic information potentially plays a much more important role in question answering than it does in information retrieval. Although many people have used syntactic evidence in Question Answering, there haven't been many detailed experiments reported ...
expand
|
||
| Searchers' criteria For assessing web pages | ||
| Anastasios Tombros, Ian Ruthven, Joemon M. Jose | ||
| Pages: 385-386 | ||
| doi>10.1145/860435.860513 | ||
Full text: PDF
|
||
|
We investigate the criteria used by online searchers when assessing the relevance of web pages to information-seeking tasks. Twenty four searchers were given three tasks each, and indicated the features of web pages which they employed when deciding ...
expand
|
||
| When query expansion fails | ||
| Bodo Billerbeck, Justin Zobel | ||
| Pages: 387-388 | ||
| doi>10.1145/860435.860514 | ||
Full text: PDF
|
||
|
The effectiveness of queries in information retrieval can be improved through query expansion. This technique automatically introduces additional query terms that are statistically likely to match documents on the intended topic. However, query expansion ...
expand
|
||
| Music modeling with random fields | ||
| Victor Lavrenko, Jeremy Pickens | ||
| Pages: 389-390 | ||
| doi>10.1145/860435.860515 | ||
Full text: PDF
|
||
| Fractal summarization: summarization based on fractal theory | ||
| Christopher C. Yang, Fu Lee Wang | ||
| Pages: 391-392 | ||
| doi>10.1145/860435.860516 | ||
Full text: PDF
|
||
|
In this paper, we introduce the fractal summarization model based on the fractal theory. In fractal summarization, the important information is captured from the source text by exploring the hierarchical structure and salient features of the document. ...
expand
|
||
| A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm | ||
| Javed A. Aslam, Virgiliu Pavlu, Robert Savell | ||
| Pages: 393-394 | ||
| doi>10.1145/860435.860517 | ||
Full text: PDF
|
||
|
We present a unified framework for simultaneously solving both the pooling problem (the construction of efficient document pools for the evaluation of retrieval systems) and metasearch (the fusion of ranked lists returned by retrieval systems in order ...
expand
|
||
| Statistical visual feature indexes in video retrieval | ||
| Xiangming Mu, Gary Marchionini | ||
| Pages: 395-396 | ||
| doi>10.1145/860435.860518 | ||
Full text: PDF
|
||
|
Four statistical visual feature indexes are proposed: SLM (Shot Length Mean), the average length of each shot in a video; SLD (Shot Length Deviation), the standard deviation of shot lengths for a video; ONM (Object Number Mean), the average number of ...
expand
|
||
| Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora | ||
| Fatiha Sadat, Masatoshi Yoshikawa, Shunsuke Uemura | ||
| Pages: 397-398 | ||
| doi>10.1145/860435.860519 | ||
Full text: PDF
|
||
|
This paper presents an approach to bilingual lexicon extraction from comparable corpora and evaluations on Cross-Language Information Retrieval. We explore a bi-directional extraction of bilingual terminology primarily from comparable corpora. A combined ...
expand
|
||
| Document-self expansion for text categorization | ||
| Yuen-Hsien Tseng, Da-Wei Juang | ||
| Pages: 399-400 | ||
| doi>10.1145/860435.860520 | ||
Full text: PDF
|
||
|
Approaches to increase training examples to hopefully improve classification effectiveness are proposed in this work. The approaches were verified by use of two Chinese collections classified by two top-performing classifiers.
expand
|
||
| An architecture for peer-to-peer information retrieval | ||
| Iraklis A. Klampanos, Joemon M. Jose | ||
| Pages: 401-402 | ||
| doi>10.1145/860435.860521 | ||
Full text: PDF
|
||
| User-trainable video annotation using multimodal cues | ||
| C-Y. Lin, M. Naphade, A. Natsev, C. Neti, J. R. Smith, B. Tseng, H. J. Nock, W. Adams | ||
| Pages: 403-404 | ||
| doi>10.1145/860435.860522 | ||
Full text: PDF
|
||
|
This paper describes progress towards a general framework for incorporating multimodal cues into a trainable system for automatically annotating user-defined semantic concepts in broadcast video. Models of arbitrary concepts are constructed by building ...
expand
|
||
| Incorporating query term dependencies in language models for document retrieval | ||
| Munirathnam Srikanth, Rohini Srihari | ||
| Pages: 405-406 | ||
| doi>10.1145/860435.860523 | ||
Full text: PDF
|
||
| Error analysis of difficult TREC topics | ||
| Xiao Hu, Sindhura Bandhakavi, Chengxiang Zhai | ||
| Pages: 407-408 | ||
| doi>10.1145/860435.860524 | ||
Full text: PDF
|
||
|
Given the experimental nature of information retrieval, progress critically depends on analyzing the errors made by existing retrieval approaches and understanding their limitations. Our research explores various hypothesized reasons for hard topics ...
expand
|
||
| XML retrieval: what to retrieve? | ||
| Jaap Kamps, Maarten Marx, Maarten de Rijke, Börkur Sigurbjörnsson | ||
| Pages: 409-410 | ||
| doi>10.1145/860435.860525 | ||
Full text: PDF
|
||
|
The fundamental difference between standard information retrieval and XML retrieval is the unit of retrieval. In traditional IR, the unit of retrieval is fixed: it is the complete document. In XML retrieval, every XML element in a document is a retrievable ...
expand
|
||
| Discovering and structuring information flow among bioinformatics resources | ||
| Joan C. Bartlett, Elaine G. Toms | ||
| Pages: 411-412 | ||
| doi>10.1145/860435.860526 | ||
Full text: PDF
|
||
|
In this poster, we present a model of the flow of information among bioinformatics resources in the context of a specific scientific problem. Combining task analysis with traditional, qualitative research, we determined the extent to which the bioinformatics ...
expand
|
||
| eBizSearch: a niche search engine for e-business | ||
| C. Lee Giles, Yves Petinot, Pradeep B. Teregowda, Hui Han, Steve Lawrence, Arvind Rangaswamy, Nirmal Pal | ||
| Pages: 413-414 | ||
| doi>10.1145/860435.860527 | ||
Full text: PDF
|
||
|
Niche Search Engines offer an efficient alternative to traditional search engines when the results returned by general-purpose search engines do not provide a sufficient degree of relevance. By taking advantage of their domain of concentration they achieve ...
expand
|
||
| Single n-gram stemming | ||
| James Mayfield, Paul McNamee | ||
| Pages: 415-416 | ||
| doi>10.1145/860435.860528 | ||
Full text: PDF
|
||
|
Stemming can improve retrieval accuracy, but stemmers are language-specific. Character n-gram tokenization achieves many of the benefits of stemming in a language independent way, but its use incurs a performance penalty. We demonstrate that selection ...
expand
|
||
| Average gain ratio: a simple retrieval performance measure for evaluation with multiple relevance levels | ||
| Tetsuya Sakai | ||
| Pages: 417-418 | ||
| doi>10.1145/860435.860529 | ||
Full text: PDF
|
||
| A comparison of various approaches for using probabilistic dependencies in language modeling | ||
| Peter Bruza, Dawei Song | ||
| Pages: 419-420 | ||
| doi>10.1145/860435.860530 | ||
Full text: PDF
|
||
| Topic hierarchy generation via linear discriminant projection | ||
| Tao Li, Shenghuo Zhu, Mitsunori Ogihara | ||
| Pages: 421-422 | ||
| doi>10.1145/860435.860531 | ||
Full text: PDF
|
||
| A personalised information retrieval tool | ||
| Innes Martin, Joemon M. Jose | ||
| Pages: 423-424 | ||
| doi>10.1145/860435.860532 | ||
Full text: PDF
|
||
|
Industry professionals and everyday users of the Internet have long accepted that due to both the size and growth of this ubiquitous repository, new tools are needed to assist with the finding and extraction of very specific resources relevant to a user's ...
expand
|
||
| Classification of source code archives | ||
| Robert Krovetz, Secil Ugurel, C. Lee Giles | ||
| Pages: 425-426 | ||
| doi>10.1145/860435.860533 | ||
Full text: PDF
|
||
|
The World Wide Web contains a number of source code archives. Programs are usually classified into various categories within the archive by hand. We report on experiments for automatic classification of source code into these categories. We examined ...
expand
|
||
| Passage retrieval vs. document retrieval for factoid question answering | ||
| Charles L. A. Clarke, Egidio L. Terra | ||
| Pages: 427-428 | ||
| doi>10.1145/860435.860534 | ||
Full text: PDF
|
||
| Evaluating retrieval performance for Japanese question answering: what are best passages? | ||
| Tetsuya Sakai, Tomoharu Kokubu | ||
| Pages: 429-430 | ||
| doi>10.1145/860435.860535 | ||
Full text: PDF
|
||
| Image classification using hybrid neural networks | ||
| Chih-Fong Tsai, Ken McGarry, John Tait | ||
| Pages: 431-432 | ||
| doi>10.1145/860435.860536 | ||
Full text: PDF
|
||
|
Use of semantic content is one of the major issues which needs to be addressed for improving image retrieval effectiveness. We present a new approach to classify images based on the combination of image processing techniques and hybrid neural networks. ...
expand
|
||
| On an equivalence between PLSI and LDA | ||
| Mark Girolami, Ata Kabán | ||
| Pages: 433-434 | ||
| doi>10.1145/860435.860537 | ||
Full text: PDF
|
||
|
Latent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori ...
expand
|
||
| Query word deletion prediction | ||
| Rosie Jones, Daniel C. Fain | ||
| Pages: 435-436 | ||
| doi>10.1145/860435.860538 | ||
Full text: PDF
|
||
|
Web search query logs contain traces of users' search modifications. One strategy users employ is deleting terms, presumably to obtain greater coverage. It is useful to model and automate term deletion when arbitrary searches are conjunctively matched ...
expand
|
||
| Assessing the effectiveness of pen-based input queries | ||
| Stephen Levin, Paul Clough, Mark Sanderson | ||
| Pages: 437-438 | ||
| doi>10.1145/860435.860539 | ||
Full text: PDF
|
||
|
In this poster, we describe an experiment exploring the effectiveness of a pen based text input device for use in query construction. Standard TREC queries were written, recognised, and subsequently retrieved upon. Comparisons between retrieval effectiveness ...
expand
|
||
| A light weight PDA-friendly collection fusion technique | ||
| Jeffery Antoniuk, Mario A. Nascimento | ||
| Pages: 439-440 | ||
| doi>10.1145/860435.860540 | ||
Full text: PDF
|
||
|
This short paper presents a light weight technique to merge results lists obtained from querying different databases. The motivation for such a technique is a general purpose search engine for Palm-OS based PDAs.
expand
|
||
| Speech-based and video-supported indexing of multimedia broadcast news | ||
| Yoshihiko Hayashi, Katsutoshi Ohtsuki, Katsuji Bessho, Osamu Mizuno, Yoshihiro Matsuo, Shoichi Matsunaga, Minoru Hayashi, Takaaki Hasegawa, Naruhiro Ikeda | ||
| Pages: 441-442 | ||
| doi>10.1145/860435.860541 | ||
Full text: PDF
|
||
|
This paper describes an automatic content indexing system for news programs, with a special emphasis on its segmentation process. The process can successfully segment an entire news program into topic-centered news stories; the primary tool is a linguistic ...
expand
|
||
| Summary evaluation and text categorization | ||
| Khurshid Ahmad, Bogdan Vrusias, Paulo C F de Oliveira | ||
| Pages: 443-444 | ||
| doi>10.1145/860435.860542 | ||
Full text: PDF
|
||
|
In general terms the evaluation of a summary depends on how close it is to the chief points in the source text. This begets the question as to what are the chief points in the source text and how is this information used in itself in identifying the ...
expand
|
||
| Rule-based word clustering for text classification | ||
| Hui Han, Eren Manavoglu, C. Lee Giles, Hongyuan Zha | ||
| Pages: 445-446 | ||
| doi>10.1145/860435.860543 | ||
Full text: PDF
|
||
|
This paper introduces a rule-based, context-dependent word clustering method, with the rules derived from various domain databases and the word text orthographic properties. Besides significant dimensionality reduction, our experiments show that such ...
expand
|
||
| HAT: a hardware assisted TOP-DOC inverted index component | ||
| S. Kagan Agun, Ophir Frieder | ||
| Pages: 447-448 | ||
| doi>10.1145/860435.860544 | ||
Full text: PDF
|
||
|
A novel Hardware Assisted Top-Doc (HAT) component is disclosed. HAT is an optimized content indexing device based on a modified inverted index structure. HAT accommodates patterns of different lengths and supports a varied posting list versus term count ...
expand
|
||
| An information-theoretic measure for document similarity | ||
| Javed A. Aslam, Meredith Frost | ||
| Pages: 449-450 | ||
| doi>10.1145/860435.860545 | ||
Full text: PDF
|
||
|
Recent work has demonstrated that the assessment of pairwise object similarity can be approached in an axiomatic manner using information theory. We extend this concept specifically to document similarity and test the effectiveness of an information-theoretic ...
expand
|
||
| Optimizing term vectors for efficient and robust filtering | ||
| David A. Evans, Jeffrey Bennett, David A. Hull | ||
| Pages: 451-452 | ||
| doi>10.1145/860435.860546 | ||
Full text: PDF
|
||
|
We describe an efficient, robust method for selecting and optimizing terms for a classification or filtering task. Terms are extracted from positive examples in training data based on several alternative term-selection algorithms, then combined additively ...
expand
|
||
| The TREC-like evaluation of music IR systems | ||
| J. Stephen Downie | ||
| Pages: 453-454 | ||
| doi>10.1145/860435.860547 | ||
Full text: PDF
|
||
|
This poster reports upon the ongoing efforts being made to establish TREC-like and other comprehensive evaluation paradigms within the Music IR (MIR) and Music Digital Library (MDL) research communities. The proposed research tasks are based upon expert ...
expand
|
||
| Stemming in the language modeling framework | ||
| James Allan, Giridhar Kumaran | ||
| Pages: 455-456 | ||
| doi>10.1145/860435.860548 | ||
Full text: PDF
|
||
| Generating hierarchical summaries for web searches | ||
| Dawn J. Lawrie, W. Bruce Croft | ||
| Pages: 457-458 | ||
| doi>10.1145/860435.860549 | ||
Full text: PDF
|
||
|
Hierarchies provide a means of organizing, summarizing and accessing information. We describe a method for automatically generating hierarchies from small collections of text, and then apply this technique to summarizing the documents retrieved by a ...
expand
|
||
| Analysis of anchor text for web search | ||
| Nadav Eiron, Kevin S. McCurley | ||
| Pages: 459-460 | ||
| doi>10.1145/860435.860550 | ||
Full text: PDF
|
||
| DEMONSTRATION SESSION: Demos | ||
| User-assisted query translation for interactive CLIR | ||
| Daqing He, Jianqiang Wang, Douglas W. Oard, Michael Nossal | ||
| Pages: 461-461 | ||
| doi>10.1145/860435.860552 | ||
Full text: PDF
|
||
| DefScriber: a hybrid system for definitional QA | ||
| Sasha Blair-Goldensohn, Kathleen R. McKeown, Andrew Hazen Schlaikjer | ||
| Pages: 462-462 | ||
| doi>10.1145/860435.860553 | ||
Full text: PDF
|
||
| Querying XML using structures and keywords in timber | ||
| Cong Yu, H. V. Jagadish, Dragomir R. Radev | ||
| Pages: 463-463 | ||
| doi>10.1145/860435.860554 | ||
Full text: PDF
|
||
|
This demonstration will describe how Timber, a native XML database system, has been extended with the capability to answer XML-style structured queries (e.g., XQuery) with embedded IR-style keyword-based non-boolean conditions. With the original structured ...
expand
|
||
| SE-LEGO: creating metasearch engines on demand | ||
| Zonghuan Wu, Vijay Raghavan, Chun Du, Komanduru Sai C, Weiyi Meng, Hai He, Clement Yu | ||
| Pages: 464-464 | ||
| doi>10.1145/860435.860555 | ||
Full text: PDF
|
||
| MIND: resource selection and data fusion in multimedia distributed digital libraries | ||
| Stefano Berretti, Jamie Callan, Henrik Nottelmann, Xiao Mang Shou, Shengli Wu | ||
| Pages: 465-465 | ||
| doi>10.1145/860435.860556 | ||
Full text: PDF
|
||
| Head/modifier pairs for everyone | ||
| Cornelis H. A. Koster | ||
| Pages: 466-466 | ||
| doi>10.1145/860435.860557 | ||
Full text: PDF
|
||
| Document retrieval from user-selected web sites | ||
| Ulrich Bohnacker, Ingrid Renz | ||
| Pages: 467-467 | ||
| doi>10.1145/860435.860558 | ||
Full text: PDF
|
||
|
We present a new tool for gathering textual information according to a query (texts) on arbitrary web sites specified by an information-seeking user. This tool is helpful in any knowledge-intensive area. Its technology is based on the vector space model ...
expand
|
||
| eArchivarius: accessing collections of electronic mail | ||
| Anton Leuski, Douglas W. Oard, Rahul Bhagat | ||
| Pages: 468-468 | ||
| doi>10.1145/860435.860559 | ||
Full text: PDF
|
||
|
We present eArchivarius an interactive system for accessing collections of electronic mail. The system combines search, clustering visualization, and time-based visualization of email messages and people who send or received the messages.
expand
|
||