|
|
Salton Award lecture: on theoretical argument in information retrieval (summary only): on theoretical argument in information retrieval |
| |
Stephen Robertson
|
|
Page: 1 |
|
doi>10.1145/345508.344658 |
|
Full text: PDF
|
|
The last winner of the Salton Award, Tefko Saracevic, gave an acceptance address at SIGIR in Philadelphia in 1997. Previous winners were William Cooper (1994), Cyril Cleverdon (1991), Karen Sparck Jones (1988) and Gerard Salton himself (1985).
In this ...
The last winner of the Salton Award, Tefko Saracevic, gave an acceptance address at SIGIR in Philadelphia in 1997. Previous winners were William Cooper (1994), Cyril Cleverdon (1991), Karen Sparck Jones (1988) and Gerard Salton himself (1985).
In this talk, I plan to follow the tradition of acceptance addresses, and present a personal view of and retrospective on some of the areas in which I work. However, I will not be saying much about what are perhaps the two most obvious parts of my work: the probabilistic approach to retrieval and evaluation of retrieval systems. Rather I will attempt to get under the skin of my take on IR, by discussing the nature of theoretical argument in the field, partly through examples. This talk is about the place of theory in the study of information retrieval (in some sense following Bill Cooper's 1994 topic), but not so much Theory with a capital T — rather what might be described as small-t theory.
The field has a very strong pragmatic orientation, reflected both in the attitudes of the commercial participants and in the emphasis on formal evaluation in the academic environment. Nevertheless, there are many theoretical ideas buried in, or implied by, the ways we talk about the field — the language we use to discuss it. I will be discussing two areas to illustrate these low-level theoretical ideas: precision devices, and the apparent symmetry between retrieval and filtering.
The phrase `precision device' used to have a rather clear meaning in IR, in the days of set-based retrieval systems. In that context, a precision device was a device to enable the restriction of the retrieved set to those most likely (out of the documents originally included) to be useful. These days, with the ubiquitous scoring and ranking methods largely replacing set-based retrieval, the idea has lost its meaning It is worth exploring the formal relationships involved to understand the change a little better.
My second area is to do with the relation between filtering and the more traditional type of adhoc information retrieval. There is a tendency and a temptation to see these as the same kind of thing, sometimes with a more specific assumption of duality, based on the inversion of the roles of documents and queries. It is important to see how far this parallel extends, and where it breaks down. I explore the nature of the duality and the kinds of reasons why it does break down.
These examples reflect my interest in the basic logical structure of information retrieval systems and the situations in which such systems may be found. I argue for a certain level of logical argument in information retrieval, which might be taken as small-t theory, though not as capital-T Theory. I believe there are reasons to think a Grand Theory of IR to be an unattainable goal — such a theory would have to encompass so many different aspects of retrieval, having to do for example with human cognition and behaviour and the structure of knowledge, as well as with the statistical concepts that inform the probabilistic approach.
However, accepting the unattainability of a Grand Theory does not preclude the development of further and more useful models based on particular aspects and lower-level logic The low-level logic is important not only in its own right, but as the basis for linking together more sophisticated theories concerned with more restricted domains. The most elaborate and complete theory of (say) user behaviour is of no use at all without a strong linkage between the parts of that theory and the entities relevant to IR that fall outside its scope. The glue that provides that linkage has to be low-level logic.
expand
|
|
|
Relevance and contributing information types of searched documents in task performance |
| |
Pertti Vakkari
|
|
Pages: 2-9 |
|
doi>10.1145/345508.345512 |
|
Full text: PDF
|
|
End-users base the relevance judgements of the searched documents on the expected contribution to their task of the information contained in the documents. There is a shortage of studies analyzing the relationships between the experienced contribution, ...
End-users base the relevance judgements of the searched documents on the expected contribution to their task of the information contained in the documents. There is a shortage of studies analyzing the relationships between the experienced contribution, relevance assessments and type of information initially sought. This study categorizes the types of information in documents being used in writing a research proposal for a master's thesis by eleven students throughout the various stages of the proposal writing process. The role of the specificity of the searched information in influencing its contribution is analyzed. The results demonstrate that different types of information are sought at different stages of the writing process and thus the contribution of the information also differs at the different stages. The categories of the contributing information can be understood of topicality.
expand
|
|
|
Relevance feedback with a small number of relevance judgements: incremental relevance feedback vs. document clustering |
| |
Makoto Iwayama
|
|
Pages: 10-16 |
|
doi>10.1145/345508.345538 |
|
Full text: PDF
|
|
The use of incremental relevance feedback and document clustering were investigated in an relevance feedback environment in which the number of relevance judgements was quite small. Through experiments on the TREC collection, the incremental relevance ...
The use of incremental relevance feedback and document clustering were investigated in an relevance feedback environment in which the number of relevance judgements was quite small. Through experiments on the TREC collection, the incremental relevance feedback approach was found not to improve the overall search effectiveness. The clustering approach was found to be promising, although it sometimes over-focuses on a particular topic in a query and ignores the others. To overcome this problem, a query-biased clustering algorithm was developed and shown to be effective.
expand
|
|
|
Do batch and user evaluations give the same results? |
| |
William Hersh,
Andrew Turpin,
Susan Price,
Benjamin Chan,
Dale Kramer,
Lynetta Sacherek,
Daniel Olson
|
|
Pages: 17-24 |
|
doi>10.1145/345508.345539 |
|
Full text: PDF
|
|
Do improvements in system performance demonstrated by batch evaluations confer the same benefit for real users? We carried out experiments designed to investigate this question. After identifying a weighting scheme that gave maximum improvement over ...
Do improvements in system performance demonstrated by batch evaluations confer the same benefit for real users? We carried out experiments designed to investigate this question. After identifying a weighting scheme that gave maximum improvement over the baseline in a non-interactive evaluation, we used it with real users searching on an instance recall task. Our results showed the weighting scheme giving beneficial results in batch studies did not do so with real users. Further analysis did identify other factors predictive of instance recall, including number of documents saved by the user, document recall, and number of documents seen by the user.
expand
|
|
|
A novel method for the evaluation of Boolean query effectiveness across a wide operational range |
| |
Eero Sormunen
|
|
Pages: 25-32 |
|
doi>10.1145/345508.345541 |
|
Full text: PDF
|
|
Traditional methods for the system-oriented evaluation of Boolean IR system suffer from validity and reliability problems. Laboratory-based research neglects the searcher and studies suboptimal queries. Research on operational systems fails to make a ...
Traditional methods for the system-oriented evaluation of Boolean IR system suffer from validity and reliability problems. Laboratory-based research neglects the searcher and studies suboptimal queries. Research on operational systems fails to make a distinction between searcher performance and system performance. This approach is neither capable of measuring performance at standard points of operation (e.g. across R0.0-R1.0).
A new laboratory-based evaluation method for Boolean IR systems is proposed. It is based on a controlled formulation of inclusive query plans, on an automatic conversion of query plans into elementary queries, and on combining elementary queries into optimal queries at standard points of operation. Major results of a large case experiment are reported. The validity, reliability, and efficiency of the method are considered in the light of empirical and analytical test data.
expand
|
|
|
Evaluating evaluation measure stability |
| |
Chris Buckley,
Ellen M. Voorhees
|
|
Pages: 33-40 |
|
doi>10.1145/345508.345543 |
|
Full text: PDF
|
|
This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment ...
This paper presents a novel way of examining the accuracy of the evaluation measures commonly used in information retrieval experiments. It validates several of the rules-of-thumb experimenters use, such as the number of queries needed for a good experiment is at least 25 and 50 is better, while challenging other beliefs, such as the common evaluation measures are equally reliable. As an example, we show that Precision at 30 documents has about twice the average error rate as Average Precision has. These results can help information retrieval researchers design experiments that provide a desired level of confidence in their results. In particular, we suggest researchers using Web measures such as Precision at 10 documents will need to use many more than 50 queries or will have to require two methods to have a very large difference in evaluation scores before concluding that the two methods are actually different.
expand
|
|
|
IR evaluation methods for retrieving highly relevant documents |
| |
Kalervo Järvelin,
Jaana Kekäläinen
|
|
Pages: 41-48 |
|
doi>10.1145/345508.345545 |
|
Full text: PDF
|
|
This paper proposes evaluation methods based on the use of non-dichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable ...
This paper proposes evaluation methods based on the use of non-dichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable from the user point of view in modern large IR environments. The proposed methods are (1) a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance, and (2) two novel measures computing the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. We then demonstrate the use of these evaluation methods in a case study on the effectiveness of query types, based on combinations of query structures and expansion, in retrieving documents of various degrees of relevance. The test was run with a best match retrieval system (In-Query1) in a text database consisting of newspaper articles. The results indicate that the tested strong query structures are most effective in retrieving highly relevant documents. The differences between the query types are practically essential and statistically significant. More generally, the novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods.
expand
|
|
|
Automatic generation of overview timelines |
| |
Russell Swan,
James Allan
|
|
Pages: 49-56 |
|
doi>10.1145/345508.345546 |
|
Full text: PDF
|
|
We present a statistical model of feature occurrence over time, and develop tests based on classical hypothesis testing for significance of term appearance on a given date. Using additional classical hypothesis testing we are able to combine these terms ...
We present a statistical model of feature occurrence over time, and develop tests based on classical hypothesis testing for significance of term appearance on a given date. Using additional classical hypothesis testing we are able to combine these terms to generate “topics” as defined by the Topic Detection and Tracking study. The groupings of terms obtained can be used to automatically generate an interactive timeline displaying the major events and topics covered by the corpus. To test the validity of our technique we extracted a large number of these topics from a test corpus and had human evaluators judge how well the selected features captured the gist of the topics, and how they overlapped with a set of known topics from the corpus. The resulting topics were highly rated by evaluators who compared them to known topics.
expand
|
|
|
Event tracking based on domain dependency |
| |
Fumiyo Fukumoto,
Yoshimi Suzuki
|
|
Pages: 57-64 |
|
doi>10.1145/345508.345548 |
|
Full text: PDF
|
|
This paper proposes a method for event tracking on broadcast news stories based on distinction between a topic and an event. A topic and an event are identified using a simple criterion called domain dependency of words: how greatly a word features a ...
This paper proposes a method for event tracking on broadcast news stories based on distinction between a topic and an event. A topic and an event are identified using a simple criterion called domain dependency of words: how greatly a word features a given set of data. The method was tested on the TDT corpus which has been developed by the TDT Pilot Study and the result can be regarded as promising the usefulness of the method.
expand
|
|
|
Improving text categorization methods for event tracking |
| |
Yiming Yang,
Tom Ault,
Thomas Pierce,
Charles W. Lattimer
|
|
Pages: 65-72 |
|
doi>10.1145/345508.345550 |
|
Full text: PDF
|
|
Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the ...
Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is extremely small, the majority of training documents are unlabelled, and most of the events have a short duration in time. We adapted several supervised text categorization methods, specifically several new variants of the k-Nearest Neighbor (kNN) algorithm and a Rocchio approach, to track events. All of these methods showed significant improvement (up to 71% reduction in weighted error rates) over the performance of the original kNN algorithm on TDT benchmark collections, making kNN among the top-performing systems in the recent TDT3 official evaluation. Furthermore, by combining these methods, we significantly reduced the variance in performance of our event tracking system over different data collections, suggesting a robust solution for parameter optimization.
expand
|
|
|
Evaluation of a simple and effective music information retrieval method |
| |
Stephen Downie,
Michael Nelson
|
|
Pages: 73-80 |
|
doi>10.1145/345508.345551 |
|
Full text: PDF
|
|
We developed, and then evaluated, a music information retrieval (MIR) system based upon the intervals found within the melodies of a collection of 9354 folksongs. The songs were converted to an interval-only representation of monophonic melodies and ...
We developed, and then evaluated, a music information retrieval (MIR) system based upon the intervals found within the melodies of a collection of 9354 folksongs. The songs were converted to an interval-only representation of monophonic melodies and then fragmented t into length-n subsections called n-grams. The length of these n-grams and the degree to which we precisely represent the intervals are variables analyzed in this paper. We constructed a collection of “musical word” databases using the text-based, SMART information retrieval system. A group of simulated queries, some of which contained simulated errors, was run against these databases. The results were evaluated using the normalized precision and normalized recall measures. Our concept of “musical words” shows great merit thus implying that useful MIR systems can be constructed simply and efficiently using pre-existing text-based information retrieval software. Second, this study is a formal and comprehensive evaluation of a MIR system using rigorous statistical analyses to determine retrieval effectiveness.
expand
|
|
|
Phonetic confusion matrix based spoken document retrieval |
| |
Savitha Srinivasan,
Dragutin Petkovic
|
|
Pages: 81-87 |
|
doi>10.1145/345508.345552 |
|
Full text: PDF
|
|
Combined word-based index and phonetic indexes have been used to improve the performance of spoken document retrieval systems primarily by addressing the out-of-vocabulary retrieval problem. However, a known problem with phonetic recognition is its limited ...
Combined word-based index and phonetic indexes have been used to improve the performance of spoken document retrieval systems primarily by addressing the out-of-vocabulary retrieval problem. However, a known problem with phonetic recognition is its limited accuracy in comparison with word level recognition. We propose a novel method for phonetic retrieval in the CueVideo system based on the probabilistic formulation of term weighting using phone confusion data in a Bayesian framework. We evaluate this method of spoken document retrieval against word-based retrieval for the search levels identified in a realistic video-based distributed learning setting. Using our test data, we achieved an average recall of 0.88 with an average precision of 0.69 for retrieval of out-of-vocabulary words on phonetic transcripts with 35% word error rate. For in-vocabulary words, we achieved a 17% improvement in recall over word-based retrieval with a 17% loss in precision for word error rites ranging from 35 to 65%.
expand
|
|
|
Multiple evidence combination in image retrieval: Diogenes searches for people on the Web |
| |
Y. Alp Aslandogan,
Clement T. Yu
|
|
Pages: 88-95 |
|
doi>10.1145/345508.345553 |
|
In this work, we examine evidence combination mechanisms for classifying multimedia information. In particular, we examine linear and Dempster-Shafer methods of evidence combination in the context of identifying personal images on the World Wide Web. ...
In this work, we examine evidence combination mechanisms for classifying multimedia information. In particular, we examine linear and Dempster-Shafer methods of evidence combination in the context of identifying personal images on the World Wide Web. An automatic web search engine named Diogenes1 searches the web for personal images and combines different pieces of evidence for identification. The sources of evidence consist of input from face detection/recognition and text/HTML analysis modules. A degree of uncertainty is involved with both of these sources. Diogenes automatically determines the uncertainty locally for each retrieval and uses this information to set a relative significance for each evidence. To our knowledge, Diogenes is the first image search engine using Dempster-Shafer evidence combination based on automatic object recognition and dynamic local uncertainty assessment. In our experiments Diogenes comfortably outperformed some well known commercial and research prototype image search engines for celebrity image queries.
expand
|
|
|
Link-based and content-based evidential information in a belief network model |
| |
Ilmério Silva,
Berthier Ribeiro-Neto,
Pável Calado,
Edleno Moura,
Nívio Ziviani
|
|
Pages: 96-103 |
|
doi>10.1145/345508.345554 |
|
Full text: PDF
|
|
This work presents an information retrieval model developed to deal with hyperlinked environments. The model is based on belief networks and provides a framework for combining information extracted from the content of the documents with information derived ...
This work presents an information retrieval model developed to deal with hyperlinked environments. The model is based on belief networks and provides a framework for combining information extracted from the content of the documents with information derived from cross-references among the documents. The information extracted from the content of the documents is based on statistics regarding the keywords in the collection and is one of the basis for traditional information retrieval (IR) ranking algorithms. The information derived from cross-references among the documents is based on link references in a hyperlinked environment and has received increased attention lately due to the success of the Web. We discuss a set of strategies for combining these two types of sources of evidential information and experiment with them using a reference collection extracted from the Web. The results show that this type of combination can improve the retrieval performance without requiring any extra information from the users at query time. In our experiments, the improvements reach up to 59% in terms of average precision figures.
expand
|
|
|
The feature quantity: an information theoretic perspective of Tfidf-like measures |
| |
Akiko Aizawa
|
|
Pages: 104-111 |
|
doi>10.1145/345508.345556 |
|
Full text: PDF
|
|
The feature quantity, a quantitative representation of specificity introduced in this paper, is based on an information theoretic perspective of co-occurrence events between terms and documents. Mathematically, the feature quantity is defined ...
The feature quantity, a quantitative representation of specificity introduced in this paper, is based on an information theoretic perspective of co-occurrence events between terms and documents. Mathematically, the feature quantity is defined as a product of probability and information, and maintains a good correspondence with the tfidf-like measures popularly used in today's IR systems. In this paper, we present a formal description of the feature quantity, as well as some illustrative examples of applying such a quantity to different types of information retrieval tasks: representative term selection and text categorization.
expand
|
|
|
INSYDER — an information assistant for business intelligence |
| |
Harald Reiterer,
Gabriela Mußler,
Thomas M. Mann,
Siegfried Handschuh
|
|
Pages: 112-119 |
|
doi>10.1145/345508.345559 |
|
Full text: PDF
|
|
The WWW is the most important resource for external business information. This paper presents a tool called INSYDER, an information assistant for finding and analysis business information from the WWW. INSYDER is a system using different agents for crawling ...
The WWW is the most important resource for external business information. This paper presents a tool called INSYDER, an information assistant for finding and analysis business information from the WWW. INSYDER is a system using different agents for crawling the Web, evaluating and visualising the results. These agents, the used visualisations, and a first summary of user studies are presented.
expand
|
|
|
Structured translation for cross-language information retrieval |
| |
Ruth Sperer,
Douglas W. Oard
|
|
Pages: 120-127 |
|
doi>10.1145/345508.345562 |
|
Full text: PDF
|
|
The paper introduces a query translation model that reflects the structure of the cross-language information retrieval task. The model is based on a structured bilingual dictionary in which the translations of each term are clustered into groups with ...
The paper introduces a query translation model that reflects the structure of the cross-language information retrieval task. The model is based on a structured bilingual dictionary in which the translations of each term are clustered into groups with distinct meanings. Query translation is modeled as a two-stage process, with the system first determining the intended meaning of a query term and then selecting translations appropriate to that meaning that might appear in the document collection. An implementation of structured translation based on automatic dictionary clustering is described and evaluated by using Chinese queries to retrieve English documents. Structured translation achieved an average precision that was statistically indistinguishable from Pirkola's technique for very short queries, but Pirkola's technique outperformed structured translation on long queries. The paper concludes with some observations on future work to improve retrieval effectiveness and on other potential uses of structured translation in interactive cross-language retrieval applications.
expand
|
|
|
Automatic adaptation of proper noun dictionaries through cooperation of machine learning and probabilistic methods |
| |
Georgios Petasis,
Alessandro Cucchiarelli,
Paola Velardi,
Georgios Paliouras,
Vangelis Karkaletsis,
Constantine D. Spyropoulos
|
|
Pages: 128-135 |
|
doi>10.1145/345508.345563 |
|
Full text: PDF
|
|
The recognition of Proper Nouns (PNs) is considered an important task in the area of Information Retrieval and Extraction. However the high performance of most existing PN classifiers heavily depends upon the availability of large dictionaries of ...
The recognition of Proper Nouns (PNs) is considered an important task in the area of Information Retrieval and Extraction. However the high performance of most existing PN classifiers heavily depends upon the availability of large dictionaries of domain-specific Proper Nouns, and a certain amount of manual work for rule writing or manual tagging. Though it is not a heavy requirement to rely on some existing PN dictionary (often these resources are available on the web), its coverage of a domain corpus may be rather low, in absence of manual updating. In this paper we propose a technique for the automatic updating of an PN Dictionary through the cooperation of an inductive and a probabilistic classifier. In our experiments we show that, whenever an existing PN Dictionary allows the identification of 50% of the proper nouns within a corpus, our technique allows, without additional manual effort, the successful recognition of about 90% of the remaining 50%.
expand
|
|
|
Document centered approach to text normalization |
| |
Andrei Mikheev
|
|
Pages: 136-143 |
|
doi>10.1145/345508.345564 |
|
Full text: PDF
|
|
In this paper we present an approach to tackle three important problems of text normalization: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected, and identification ...
In this paper we present an approach to tackle three important problems of text normalization: sentence boundary disambiguation, disambiguation of capitalized words when they are used in positions where capitalization is expected, and identification of abbreviations. The main feature of our approach is that it uses a minimum of pre-built resources, instead dynamically inferring disambiguation clues from the entire document itself. This makes it domain independent, closely targeted to each individual document and portable to other languages. We thoroughly evaluated this approach on several corpora and it showed high accuracy.
expand
|
|
|
OCELOT: a system for summarizing Web pages |
| |
Adam L. Berger,
Vibhu O. Mittal
|
|
Pages: 144-151 |
|
doi>10.1145/345508.345565 |
|
Full text: PDF
|
|
We introduce OCELOT, a prototype system for automatically generating the “gist” of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in ...
We introduce OCELOT, a prototype system for automatically generating the “gist” of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in both structure and content. Instead of coherent text with a well-defined discourse structure, they are more often likely to be a chaotic jumble of phrases, links, graphics and formatting commands. Such text provides little foothold for extractive summarization techniques, which attempt to generate a summary of a document by excerpting a contiguous, coherent span of text from it. This paper builds upon recent work in non-extractive summarization, producing the gist of a web page by “translating” it into a more concise representation rather than attempting to extract a text span verbatim. OCELOT uses probabilistic models to guide it in selecting and ordering words into a gist. This paper describes a technique for learning these models automatically from a collection of human-summarized web pages.
expand
|
|
|
Extracting sentence segments for text summarization: a machine learning approach |
| |
Wesley T. Chuang,
Jihoon Yang
|
|
Pages: 152-159 |
|
doi>10.1145/345508.345566 |
|
Full text: PDF
|
|
With the proliferation of the Internet and the huge amount of data it transfers, text summarization is becoming more important. We present an approach to the design of an automatic text summarizer that generates a summary by extracting sentence segments. ...
With the proliferation of the Internet and the huge amount of data it transfers, text summarization is becoming more important. We present an approach to the design of an automatic text summarizer that generates a summary by extracting sentence segments. First, sentences are broken into segments by special cue markers. Each segment is represented by a set of predefined features (e.g. location of the segment, average term frequencies of the words occurring in the segment, number of title words in the segment, and the like). Then a supervised learning algorithm is used to train the summarizer to extract important sentence segments, based on the feature vector. Results of experiments on U.S. patents indicate that the performance of the proposed approach compares very favorably with other approaches (including Microsoft Word summarizer) in terms of precision, recall, and classification accuracy.
expand
|
|
|
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages |
| |
Ion Androutsopoulos,
John Koutsias,
Konstantinos V. Chandrinos,
Constantine D. Spyropoulos
|
|
Pages: 160-167 |
|
doi>10.1145/345508.345569 |
|
Full text: PDF
|
|
The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative ...
The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in “encrypted” form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.
expand
|
|
|
Text filtering by boosting naive Bayes classifiers |
| |
Yu-Hwan Kim,
Shang-Yoon Hahn,
Byoung-Tak Zhang
|
|
Pages: 168-175 |
|
doi>10.1145/345508.345572 |
|
Full text: PDF
|
|
Several machine learning algorithms have recently been used for text categorization and filtering. In particular, boosting methods such as AdaBoost have shown good performance applied to real text data. However, most of existing boosting algorithms are ...
Several machine learning algorithms have recently been used for text categorization and filtering. In particular, boosting methods such as AdaBoost have shown good performance applied to real text data. However, most of existing boosting algorithms are based on classifiers that use binary-valued features. Thus, they do not fully make use of the weight information provided by standard term weighting methods. In this paper, we present a boosting-based learning method for text filtering that uses naive Bayes classifiers as a weak learner. The use of naive Bayes allows the boosting algorithm to utilize term frequency information while maintaining probabilistically accurate confidence ratio. Applied to TREC-7 and TREC-8 filtering track documents, the proposed method obtained a significant improvement in LF1, LF2, F1 and F3 measures compared to the best results submitted by other TREC entries.
expand
|
|
|
Document filtering method using non-relevant information profile |
| |
Keiichiro Hoashi,
Kazunori Matsumoto,
Naomi Inoue,
Kazuo Hashimoto
|
|
Pages: 176-183 |
|
doi>10.1145/345508.345573 |
|
Full text: PDF
|
|
Document filtering is a task to retrieve documents relevant to a user's profile from a flow of documents. Generally, filtering systems calculate the similarity between the profile and each incoming document, and retrieve documents with similarity higher ...
Document filtering is a task to retrieve documents relevant to a user's profile from a flow of documents. Generally, filtering systems calculate the similarity between the profile and each incoming document, and retrieve documents with similarity higher than a threshold. However, many systems set a relatively high threshold to reduce retrieval of non-relevant documents, which results in the ignorance of many relevant documents. In this paper, we propose the use of a non-relevant information profile to reduce the mistaken retrieval of non-relevant documents. Results from experiments show that this filter has successfully rejected a sufficient number of non-relevant documents, resulting in an improvement of filtering performance.
expand
|
|
|
Question-answering by predictive annotation |
| |
John Prager,
Eric Brown,
Anni Coden,
Dragomir Radev
|
|
Pages: 184-191 |
|
doi>10.1145/345508.345574 |
|
Full text: PDF
|
|
We present a new technique for question answering called Predictive Annotation. Predictive Annotation identifies potential answers to questions in text, annotates them accordingly and indexes them. This technique, along with a complementary analysis ...
We present a new technique for question answering called Predictive Annotation. Predictive Annotation identifies potential answers to questions in text, annotates them accordingly and indexes them. This technique, along with a complementary analysis of questions, passage-level ranking and answer selection, produces a system effective at answering natural-language fact-seeking questions posed against large document collections. Experimental results show the effects of different parameter settings and lead to a number of general observations about the question-answering problem.
expand
|
|
|
Bridging the lexical chasm: statistical approaches to answer-finding |
| |
Adam Berger,
Rich Caruana,
David Cohn,
Dayne Freitag,
Vibhu Mittal
|
|
Pages: 192-199 |
|
doi>10.1145/345508.345576 |
|
Full text: PDF
|
|
This paper investigates whether a machine can automatically learn the task of finding, within a large collection of candidate responses, the answers to questions. The learning process consists of inspecting a collection of answered questions and characterizing ...
This paper investigates whether a machine can automatically learn the task of finding, within a large collection of candidate responses, the answers to questions. The learning process consists of inspecting a collection of answered questions and characterizing the relation between question and answer with a statistical model. For the purpose of learning this relation, we propose two sources of data: Usenet FAQ documents and customer service call-center dialogues from a large retail company. We will show that the task of “answer-finding” differs from both document retrieval and tradition question-answering, presenting challenges different from those found in these problems. The central aim of this work is to discover, through theoretical and empirical investigation, those statistical techniques best suited to the answer-finding problem.
expand
|
|
|
Building a question answering test collection |
| |
Ellen M. Voorhees,
Dawn M. Tice
|
|
Pages: 200-207 |
|
doi>10.1145/345508.345577 |
|
Full text: PDF
|
|
The TREC-8 Question Answering (QA) Track was the first large-scale evaluation of domain-independent question answering systems. In addition to fostering research on the QA task, the track was used to investigate whether the evaluation methodology used ...
The TREC-8 Question Answering (QA) Track was the first large-scale evaluation of domain-independent question answering systems. In addition to fostering research on the QA task, the track was used to investigate whether the evaluation methodology used for document retrieval is appropriate for a different natural language processing task. As with document relevance judging, assessors had legitimate differences of opinions as to whether a response actually answers a question, but comparative evaluation of QA systems was stable despite these differences. Creating a reusable QA test collection is fundamentally more difficult than creating a document retrieval test collection since the QA task has no equivalent to document identifiers.
expand
|
|
|
Document clustering using word clusters via the information bottleneck method |
| |
Noam Slonim,
Naftali Tishby
|
|
Pages: 208-215 |
|
doi>10.1145/345508.345578 |
|
Full text: PDF
|
|
We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x, y), we first cluster the words, ...
We present a novel implementation of the recently introduced information bottleneck method for unsupervised document clustering. Given a joint empirical distribution of words and documents, p(x, y), we first cluster the words, Y, so that the obtained word clusters, Ytilde;, maximally preserve the information on the documents. The resulting joint distribution. p(X, Ytilde;), contains most of the original information about the documents, I(X; Ytilde;) ≈ I(X; Y), but it is much less sparse and noisy. Using the same procedure we then cluster the documents, X, so that the information about the word-clusters is preserved. Thus, we first find word-clusters that capture most of the mutual information about to set of documents, and then find document clusters, that preserve the information about the word clusters. We tested this procedure over several document collections based on subsets taken from the standard 20Newsgroups corpus. The results were assessed by calculating the correlation between the document clusters and the correct labels for these documents. Finding from our experiments show that this double clustering procedure, which uses the information bottleneck method, yields significantly superior performance compared to other common document distributional clustering algorithms. Moreover, the double clustering procedure improves all the distributional clustering methods examined here.
expand
|
|
|
Latent semantic space: iterative scaling improves precision of inter-document similarity measurement |
| |
Rie Kubota Ando
|
|
Pages: 216-223 |
|
doi>10.1145/345508.345579 |
|
Full text: PDF
|
|
We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an ...
We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an average precision up to 17.8% higher than that of singular value decomposition (SVD) used for Latent Semantic Indexing. The best performance was achieved with dimensional reduction rates that were 43% higher than SVD on average. Our algorithm creates basis vectors for a reduced space by iteratively “scaling” vectors and computing eigenvectors. Unlike SVD, it breaks the symmetry of documents and terms to capture information more evenly across documents. We also discuss correlation with a probabilistic model and evaluate a method for selecting the dimensionality using log-likelihood estimation.
expand
|
|
|
An investigation of linguistic features and clustering algorithms for topical document clustering |
| |
Vasileios Hatzivassiloglou,
Luis Gravano,
Ankineedu Maganti
|
|
Pages: 224-231 |
|
doi>10.1145/345508.345582 |
|
Full text: PDF
|
|
We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical ...
We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information from multiple sources is described and applied to DARPA's Topic Detection and Tracking phase 2 (TDT2) data. This model, based on log-linear regression, alleviates the need for extensive search in order to determine optimal weights for combining input features. Through an extensive series of experiments with more than 40,000 documents from multiple news sources and modalities, we establish that both the choice of clustering algorithm and the introduction of the additional features have an impact on clustering performance. We apply our optimal combination of features to the TDT2 test data, obtaining partitions of the documents that compare favorably with the results obtained by participants in the official TDT2 competition.
expand
|
|
|
The impact of database selection on distributed searching |
| |
Allison L. Powell,
James C. French,
Jamie Callan,
Margaret Connell,
Charles L. Viles
|
|
Pages: 232-239 |
|
doi>10.1145/345508.345584 |
|
Full text: PDF
|
|
The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — database selection, query processing, and results merging. In this paper ...
The proliferation of online information resources increases the importance of effective and efficient distributed searching. Distributed searching is cast in three parts — database selection, query processing, and results merging. In this paper we examine the effect of database selection on retrieval performance. We look at retrieval performance in three different distributed retrieval testbeds and distill some general results. First we find that good database selection can result in better retrieval effectiveness than can be achieved in a centralized database. Second we find that good performance can be achieved when only a few sites are selected and that the performance generally increases as more sites are selected. Finally we find that when database selection is employed, it is not necessary to maintain collection wide information (CWI), e.g. global idf. Local information can be used to achieve superior performance. This means that distributed systems can be engineered with more autonomy and less cooperation. This work suggests that improvements in database selection can lead to broader improvements in retrieval performance, even in centralized (i.e. single database) systems. Given a centralized database and a good selection mechanism, retrieval performance can be improved by decomposing that database conceptually and employing a selection step.
expand
|
|
|
Hill climbing algorithms for content-based retrieval of similar configurations |
| |
Dimitris Papadias
|
|
Pages: 240-247 |
|
doi>10.1145/345508.345587 |
|
Full text: PDF
|
|
The retrieval of stored images matching an input configuration is an important form of content-based retrieval. Exhaustive processing (i.e., retrieval of the best solutions) of configuration similarity queries is, in general, exponential and fast search ...
The retrieval of stored images matching an input configuration is an important form of content-based retrieval. Exhaustive processing (i.e., retrieval of the best solutions) of configuration similarity queries is, in general, exponential and fast search for sub-optimal solutions is the only way to deal with the vast (and ever increasing) amounts of multimedia information in several real-time applications. In this paper we discuss the utilization of hill climbing heuristics that can provide very good results within limited processing time. We propose several heuristics, which differ on the way that they search through the solution space, and identify the best ones depending on the query and image characteristics. Finally we develop new algorithms that take advantage of the specific structure of the problem to improve performance.
expand
|
|
|
Partial collection replication versus caching for information retrieval systems |
| |
Zhihong Lu,
Kathryn S. McKinley
|
|
Pages: 248-255 |
|
doi>10.1145/345508.345591 |
|
Full text: PDF
|
|
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms to attain timely and accurate retrieval of unstructured text. In this paper, we compare two mechanisms to improve IR system performance: partial collection ...
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms to attain timely and accurate retrieval of unstructured text. In this paper, we compare two mechanisms to improve IR system performance: partial collection replication and caching. When queries have locality, both mechanisms return results more quickly than sending queries to the original collection(s). Caches return results when queries exactly match a previous one. Partial replicas are a form of caching that return results when the IR technology determines the query is a good match. Caches are simpler and faster, but replicas can increase locality by detecting similarity between queries that are not exactly the same. We use real traces from THOMAS and Excite to measure query locality and similarity. With a very restrictive definition of query similarity, similarity improves query locality up to 15% over exact match. We use a validated simulator to compare their performance, and find that even if the partial replica hit rate increases only 3 to 6%, it will outperform simple caching under a variety of configurations. A combined approach will probably yield the best performance.
expand
|
|
|
Hierarchical classification of Web content |
| |
Susan Dumais,
Hao Chen
|
|
Pages: 256-263 |
|
doi>10.1145/345508.345593 |
|
Full text: PDF
|
|
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned ...
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level.
We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplicative decision rule. Since the sequential approach is much more efficient, requiring only 14%-16% of the comparisons used in the other approaches, we find it to be a good choice for classifying text into large hierarchical structures.
expand
|
|
|
A practical hypertext catergorization method using links and incrementally available class information |
| |
Hyo-Jung Oh,
Sung Hyon Myaeng,
Mann-Ho Lee
|
|
Pages: 264-271 |
|
doi>10.1145/345508.345594 |
|
Full text: PDF
|
|
As WWW grows at an increasing speed, a classifier targeted at hypertext has become in high demand. While document categorization is quite a mature, the issue of utilizing hypertext structure and hyperlinks has been relatively unexplored. In this paper, ...
As WWW grows at an increasing speed, a classifier targeted at hypertext has become in high demand. While document categorization is quite a mature, the issue of utilizing hypertext structure and hyperlinks has been relatively unexplored. In this paper, we propose a practical method for enhancing both the speed and the quality of hypertext categorization using hyperlinks. In comparison against a recently proposed technique that appears to be the only one of the kind, we obtained up to 18.5% of improvement in effectiveness while reducing the processing time dramatically. We attempt to explain through experiments what factors contribute to the improvement.
expand
|
|
|
Topical locality in the Web |
| |
Brian D. Davison
|
|
Pages: 272-279 |
|
doi>10.1145/345508.345597 |
|
Full text: PDF
|
|
Most web pages are linked to others with related content. This idea, combined with another that says that text in, and possibly around, HTML anchors describe the pages to which they point, is the foundation for a usable World-Wide Web. ...
Most web pages are linked to others with related content. This idea, combined with another that says that text in, and possibly around, HTML anchors describe the pages to which they point, is the foundation for a usable World-Wide Web. In this paper, we examine to what extent these ideas hold by empirically testing whether topical locality mirrors spatial locality of pages on the Web. In particular, we find that the likelihood of linked pages having similar textual content to be high; the similarity of sibling pages increases when the links from the parent are close together; titles, descriptions, and anchor text represent at least part of the target page; and that anchor text may be a useful discriminator among unseen child pages. These results show the foundations necessary for the success of many web systems, including search engines, focused crawlers, linkage analyzers, and intelligent web agents.
expand
|
|
|
Interactive Internet search: keyword, directory and query reformulation mechanisms compared |
| |
Peter Bruza,
Robert McArthur,
Simon Dennis
|
|
Pages: 280-287 |
|
doi>10.1145/345508.345598 |
|
Full text: PDF
|
|
This article compares search effectiveness when using query-based Internet search (via the Google search engine), directory-based search (via Yahoo) and phrase-based query reformulation assisted search (via the Hyperindex browser) by means of a controlled, ...
This article compares search effectiveness when using query-based Internet search (via the Google search engine), directory-based search (via Yahoo) and phrase-based query reformulation assisted search (via the Hyperindex browser) by means of a controlled, user-based experimental study. The focus was to evaluate aspects of the search process. Cognitive load was measured using a secondary digit-monitoring task to quantify the effort of the user in various search states; independent relevance judgements were employed to gauge the quality of the documents accessed during the search process. Time was monitored in various search states. Results indicated the directory-based search does not offer increased relevance over the query-based search (with or without query formulation assistance), and also takes longer. Query reformulation does significantly improve the relevance of the documents through which the user must trawl versus standard query-based internet search. However, the improvement in document relevance comes at the cost of increased search time and increased cognitive load.
expand
|
|
|
Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web |
| |
Xiaolan Zhu,
Susan Gauch
|
|
Pages: 288-295 |
|
doi>10.1145/345508.345602 |
|
Full text: PDF
|
|
Most information retrieval systems on the Internet rely primarily on similarity ranking algorithms based solely on term frequency statistics. Information quality is usually ignored. This leads to the problem that documents are retrieved without regard ...
Most information retrieval systems on the Internet rely primarily on similarity ranking algorithms based solely on term frequency statistics. Information quality is usually ignored. This leads to the problem that documents are retrieved without regard to their quality. We present an approach that combines similarity-based similarity ranking with quality ranking in centralized and distributed search environments. Six quality metrics, including the currency, availability, information-to-noise ratio, authority, popularity, and cohesiveness, were investigated. Search effectiveness was significantly improved when the currency, availability, information-to-noise ratio and page cohesiveness metrics were incorporated in centralized search. The improvement seen when the availability, information-to- noise ratio, popularity, and cohesiveness metrics were incorporated in site selection was also significant. Finally, incorporating the popularity metric in information fusion resulted in a significant improvement. In summary, the results show that incorporating quality metrics can generally improve search effectiveness in both centralized and distributed search environments.
expand
|
|
|
Does “authority” mean quality? predicting expert quality ratings of Web documents |
| |
Brian Amento,
Loren Terveen,
Will Hill
|
|
Pages: 296-303 |
|
doi>10.1145/345508.345603 |
|
Full text: PDF
|
|
For many topics, the World Wide Web contains hundreds or thousands of relevant documents of widely varying quality. Users face a daunting challenge in identifying a small subset of documents worthy of their attention.
Link analysis algorithms have received ...
For many topics, the World Wide Web contains hundreds or thousands of relevant documents of widely varying quality. Users face a daunting challenge in identifying a small subset of documents worthy of their attention.
Link analysis algorithms have received much interest recently, in large part for their potential to identify high quality items. We report here on an experimental evaluation of this potential.
We evaluated a number of link and content-based algorithms using a dataset of web documents rated for quality by human topic experts. Link-based metrics did a good job of picking out high-quality items. Precision at 5 is about 0.75, and precision at 10 is about 0.55; this is in a dataset where 0.32 of all documents were of high quality. Surprisingly, a simple content-based metric performed nearly as well; ranking documents by the total number of pages on their containing site.
expand
|
|
|
Document classification on neural networks using only positive examples (poster session) |
| |
Larry M. Manevitz,
Malik Yousef
|
|
Pages: 304-306 |
|
doi>10.1145/345508.345608 |
|
Full text: PDF
|
|
In this paper, we show how a simple feed-forward neural network can be trained to filter documents when only positive information is available, and that this method seems to be superior to more standard methods, such as tf-idf retrieval based on an “average ...
In this paper, we show how a simple feed-forward neural network can be trained to filter documents when only positive information is available, and that this method seems to be superior to more standard methods, such as tf-idf retrieval based on an “average vector”. A novel experimental finding that retrieval is enhanced substantially in this context by carrying out a certain kind of uniform transformation (“Hadamard”) of the information prior to the training of the network.
expand
|
|
|
New paradigms in information visualization (poster session) |
| |
Peter Au,
Matthew Carey,
Shalini Sewraz,
Yike Guo,
Stefan M. Rüger
|
|
Pages: 307-309 |
|
doi>10.1145/345508.345610 |
|
Full text: PDF
|
|
We present three new visualization front-ends that aid navigation through the set of documents returned by a search engine (hit documents). We cluster the hit documents to visually group these documents and label the groups with related words. The ...
We present three new visualization front-ends that aid navigation through the set of documents returned by a search engine (hit documents). We cluster the hit documents to visually group these documents and label the groups with related words. The different front-ends cater for different user needs, but all can browse cluster information as well as drilling up or down in one or more clusters and refining the search using one or more of the suggested related keywords.
expand
|
|
|
Latent semantic indexing model for Boolean query formulation (poster session) |
| |
Dae-Ho Baek,
HeuiSeok Lim,
Hae-Chang Rim
|
|
Pages: 310-312 |
|
doi>10.1145/345508.345612 |
|
Full text: PDF
|
|
A new model named Boolean Latent Semantic Indexing model based on the Singular Value Decomposition and Boolean query formulation is introduced. While the Singular Value Decomposition alleviates the problems of lexical matching in the traditional information ...
A new model named Boolean Latent Semantic Indexing model based on the Singular Value Decomposition and Boolean query formulation is introduced. While the Singular Value Decomposition alleviates the problems of lexical matching in the traditional information retrieval model, Boolean query formulation can help users to make precise representation of their information search needs. Retrieval experiments on a number of test collections seem to show that the proposed model achieves substantial performance gains over the Latent Semantic Indexing model.
expand
|
|
|
Generation of user profiles for information filtering — research agenda (poster session) |
| |
Tsvi Kuflik,
Peretz Shoval
|
|
Pages: 313-315 |
|
doi>10.1145/345508.345615 |
|
Full text: PDF
|
|
In information filtering (IF) systems, user long-term needs we expressed as user profiles. The quality of a user profile has a major impact on the performance of IF systems. The focus of the proposed research is on the study of user profile generation ...
In information filtering (IF) systems, user long-term needs we expressed as user profiles. The quality of a user profile has a major impact on the performance of IF systems. The focus of the proposed research is on the study of user profile generation and update. The paper introduces methods for user profile generation, and proposes a research agenda for their comparison and evaluation.
expand
|
|
|
Variance based classifier comparison in text catergorization (poster session) |
| |
Atsuhiro Takasu,
Kenro Aihara
|
|
Pages: 316-317 |
|
doi>10.1145/345508.345618 |
|
Full text: PDF
|
|
Text categorization is one of the key functions for utilizing vast amount of documents. It can be seen as a classification problem, which has been studied in pattern recognition and machine learning fields for a long time and several classification methods ...
Text categorization is one of the key functions for utilizing vast amount of documents. It can be seen as a classification problem, which has been studied in pattern recognition and machine learning fields for a long time and several classification methods have been developed such as statistical classification, decision tree, support vector machines and so on. Many researchers applied those classification methods to text categorization and reported their performance (e.g., decision tree[3], Bayes classifier[2], support vector machine[l]). Yang conducted comprehensive study of comparison or text categorization and reported that k nearest neighbor and support vector machines works well for text categorization[4].
In the previous studies, classification methods were usually compared using single pair of training and test data However, classification method with more complex family of classifiers requires more training data and small training data may result in deriving unreliable classifier, that is, the performance of the derived classifier varies much depending on training data. Therefore, we need to take the size of training data into account when comparing and selecting a classification method. In this paper, we discuss how to select a classifier from those derived by various classification methods and how the size of training data affects the performance of the derived classifier.
In order to evaluate the reliability of classification method, we consider the variance of accuracy of derived classifier. We first construct a statistical model. In the text categorization, each document is usually represented with a feature vector that consists of weighted frequencies of terms. In the vector space model, document is a point in high dimensional feature space and a classifier separates the feature space into subspaces each of which is labeled with a category.
expand
|
|
|
The use of phrases from query texts in information retrieval (poster session) |
| |
Masumi Narita,
Yasushi Ogawa
|
|
Pages: 318-320 |
|
doi>10.1145/345508.345621 |
|
Full text: PDF
|
|
|
|
|
Pseudo-frequency method (poster session): an efficient document ranking retrieval method for n-gram indexing |
| |
Ogawa Yasushi
|
|
Pages: 321-323 |
|
doi>10.1145/345508.345622 |
|
Although n-gram (n successive characters) indexing is widely used in retrieval systems for documents in Japanese and other Asian languages, it is difficult to process ranking retrieval efficiently using n-gram indexing. This is because frequency ...
Although n-gram (n successive characters) indexing is widely used in retrieval systems for documents in Japanese and other Asian languages, it is difficult to process ranking retrieval efficiently using n-gram indexing. This is because frequency information for query words needs to be computed using indexed data since this information is not directly available from the n-gram index. To reduce processing costs, this paper proposes a pseudo-frequency method, which uses a word's estimated frequencies instead of precise ones. The results of experiments on NTCIR, a Japanese IR test collection, showed that the proposed method speeded up retrieval without degrading retrieval effectiveness.
expand
|
|
|
Lexical semantic relatedness and online new event detection (poster session) |
| |
Nicola Stokes,
Paula Hatch,
Joe Carthy
|
|
Pages: 324-325 |
|
doi>10.1145/345508.345623 |
|
Full text: PDF
|
|
|
|
|
Modeling question-response patterns by scaling and visualization (poster session) |
| |
Mark Rorvig
|
|
Pages: 326-327 |
|
doi>10.1145/345508.345624 |
|
Full text: PDF
|
|
The evaluation of question difficulty is usually considered the domain of Latent Trait Theory. However, these methods require standardized question sets normalized by large populations, rendering them inefficient for use in the numerous areas where questions ...
The evaluation of question difficulty is usually considered the domain of Latent Trait Theory. However, these methods require standardized question sets normalized by large populations, rendering them inefficient for use in the numerous areas where questions must be evaluated. A new technique is illustrated that models the question-response cycle well, but without the procedural difficulty of the traditional methods.
expand
|
|
|
The effect of query type on subject searching behavior of image databases (poster session): an exploratory study |
| |
Efthimis N. Efthimiadis,
Raya Fidel
|
|
Pages: 328-330 |
|
doi>10.1145/345508.345625 |
|
Full text: PDF
|
|
|
|
|
The role of a judge in a user based retrieval experiment (poster session) |
| |
Mingfang Wu,
Michael Fuller,
Ross Wilkinson
|
|
Pages: 331-333 |
|
doi>10.1145/345508.345628 |
|
Full text: PDF
|
|
|
|
|
Auto-construction of a live thesaurus from search term logs for interactive Web search (poster session) |
| |
Shui-Lung Chuang,
Hsiao-Tieh Pu,
Wen-Hsiang Lu,
Lee-Feng Chien
|
|
Pages: 334-336 |
|
doi>10.1145/345508.345630 |
|
The purpose of this paper is to present an on-going research that is intended to construct a live thesaurus directly from search term logs of real-world search engines. Such a thesaurus designed can contain representative search terms, their frequency ...
The purpose of this paper is to present an on-going research that is intended to construct a live thesaurus directly from search term logs of real-world search engines. Such a thesaurus designed can contain representative search terms, their frequency in use, the corresponding subject categories, the associated and relevant terms, and the hot visiting Web sites/pages the search terms may reach.
expand
|
|
|
Cognitive approach for building user model in an information retrieval context (poster session) |
| |
Amina Sayeb Belhassen,
Nabil Ben Abdallah,
Henda Hadjami Ben Ghezala
|
|
Pages: 337-338 |
|
doi>10.1145/345508.345632 |
|
Full text: PDF
|
|
The recent development of communication networks and multimedia system provides users with the availability of a huge amount of information making worse the problem of information overload [9]. The evolution of system design is necessary becoming more ...
The recent development of communication networks and multimedia system provides users with the availability of a huge amount of information making worse the problem of information overload [9]. The evolution of system design is necessary becoming more user centred, and more personally involving. A review of survey studies on Internet users since 1993 confirms that a greater percentage of people are becoming online citizens, and professionals are integrating more online components into their work process. A review of the experimental literature on Internet user's reveals that there is intense interest in humanising the online environment by integrating affective and cognitive components [8].
We are specially in concern with the effects on the evolution on information retrieval. We can notice significant changes in the information retrieval world over the past five or so years due to the emergence of Internet and one of its most important and widely used services, the world wide Web (WWW) or simply the Web.
While reviewing the progress of research in information retrieval and user modelling, we can observe that many systems and prtotypes are created [5], [10], [11], but all of them, share some basic limitations: the techniques used to represent knowledge in the user model is based on simple list of keyword, the type of the considered knowledge is very limited, usually restricted to single word, or to (some) structural characteristic: the learning capability are very poor.
We aim to propose a cognitive approach for building user model in an information retrieval context. In fact cognitive approach is based on identifying how users process information and what constitutes an appropriate model to represent this process, and because IR, under the cognitive paradigm takes the user into account in a high-priority way [1]. However, within the cognitive paradigm, there is no general model valid for our documentary approach that satisfactorily how user knowledge is represented for the purpose of processing information. The lack of such a model does not allow one to identify a user's cognitive state with regard to his or her information needs and requirements.
Methodology adopting the cognitive viewpoint in IR are Synthesised by Daniel [4] in three groups, which comprise the representation of:
users and their problem, which stems from the hypothesis proposed by Belkin on the `anomalous states of knowledge' (ASK), according to which the user searches for information
search strategy, which compile the different ways search strategies and processes are carried out, depending on the variable involved - user, intermediary, IR systems [6],[7]
document and information, which is considered a major goal of current IR research, since it embraces the whole corpus of studies about user models intended to eliminate the intermediary's role in retrieval system. The aim of this approach is to allow users direct access to the system by means of the representation of documents and intelligent interfaces.
User-centered paradigm now dominates in studies of information needs and information retrieval. We have the goal of developing new approaches to information retrieval which are based on user modelling techniques for building and managing the representation of the user preferances. In this paper, we describe two complementary approaches which are necessary for building user model and its integration in an information retrieval system:
a conceptual one based on the description of knowledge needed by the user in an information retrieval context,
a functional approach which deals with dynamic aspects of the model. Within this approach we aim to determine the role played by the model in an information retrieval context.
In many studies of IR interesting in user modelling we find different kind of knowledge trying to describe user's need. So our conceptual approach has consisted in enumerating these knowledge and integrating them in their adequate components in information retrieval architecture. Almost of these studies that identified cognitive characteristics have used quantitative methods to measure them. What is needed is a qualitative study and appropriate method to ascertain these cognitive characteristic [12].
Our main objective is the development of techniques for modelling the user as an interactive part of IR, so we propose our functional approach which deals with identifying cognitive characteristic within the role played by the user model in an information retrieval architecture. So we began by presenting the conceptual approach and so the functional one.
expand
|
|
|
Multimedia information retrieval from recorded presentations (poster session) |
| |
Wolfgang Hürst,
Rainer Müller,
Christoph Mayer
|
|
Pages: 339-341 |
|
doi>10.1145/345508.345636 |
|
Full text: PDF
|
|
In presentation recording special effort is usually put into the automation of the production process, that is in automatically creating high quality data files without much or any need for manual recording and post-editing [5]. With the advent of such ...
In presentation recording special effort is usually put into the automation of the production process, that is in automatically creating high quality data files without much or any need for manual recording and post-editing [5]. With the advent of such systems and their usage in classroom teaching, at conferences, etc., there is an increasing need for techniques and abilities which enable users to search in those documents and to localize some specific information. In this paper we describe how we integrated information retrieval techniques into the Authoring on the Fly (AOF) system, an approach for automatic presentation recording. We have chosen the AOF system for two reasons. On the one hand, it is a well-established way for presentation recording, used by various universities and institutions1. On the other hand it is general enough to illustrate typical problems and challenges a developer is facing when designing a system for information retrieval from multimedia data streams which occur in the presentation recording scenario.
expand
|
|
|
Influence of speech recognition errors on topic detection (poster session) |
| |
J. Scott McCarley,
Martin Franz
|
|
Pages: 342-344 |
|
doi>10.1145/345508.345638 |
|
Full text: PDF
|
|
We investigate the effect of speech-recognition errors on a system for the unsupervised, nearly synchronous clustering of broadcast news stories, using the TDT (Topic Detection and Tracking) Corpora. Two questions are addressed: (1) Are speech recognition ...
We investigate the effect of speech-recognition errors on a system for the unsupervised, nearly synchronous clustering of broadcast news stories, using the TDT (Topic Detection and Tracking) Corpora. Two questions are addressed: (1) Are speech recognition errors detrimental to the performance of the system? (2) Can a background collection of contemporaneous clean text improve performance? We investigate both the large-cluster and small-cluster limits.
expand
|
|
|
Word document density and relevance scoring (poster session) |
| |
Martin Franz,
J. Scott McCarley
|
|
Pages: 345-347 |
|
doi>10.1145/345508.345641 |
|
Full text: PDF
|
|
Previous work addressing the issue of word distribution in documents has shown the importance of Word repetitiveness as an indicator of the word content-bearing characteristics. In this paper we propose a simple method using a measure of the tendency ...
Previous work addressing the issue of word distribution in documents has shown the importance of Word repetitiveness as an indicator of the word content-bearing characteristics. In this paper we propose a simple method using a measure of the tendency of words to repeat within a document to separate the words with similar document frequencies, but different topic discriminating characteristics. We describe the application of the new measure in query-document relevance scoring. Experiments on the TREC Ad Hoc and Spoken Document Retrieval tasks [7] show useful performance improvements.
expand
|
|
|
Ranking digital images using combination of evidences (poster session) |
| |
Iadh Ounis
|
|
Pages: 348-350 |
|
doi>10.1145/345508.345643 |
|
Full text: PDF
|
|
|
|
|
Collaborative filtering and the generalized vector space model (poster session) |
| |
Ian Soboroff,
Charles Nicholas
|
|
Pages: 351-353 |
|
doi>10.1145/345508.345646 |
|
Full text: PDF
|
|
Collaborative filtering is a technique for recommending documents to users based on how similar their tastes are to other users. If two users tend to agree on what they like, the system will recommend the same documents to them. The generalized vector ...
Collaborative filtering is a technique for recommending documents to users based on how similar their tastes are to other users. If two users tend to agree on what they like, the system will recommend the same documents to them. The generalized vector space model of information retrieval represents a document by a vector of its similarities to all other documents. The process of collaborative filtering is nearly identical to the process of retrieval using GVSM in a matrix of user ratings. Using this observation, a model for filtering collaboratively using document content is possible.
expand
|
|
|
Theme-based retrieval of Web news (poster session) |
| |
Nuno Maria,
Mário J. Silva
|
|
Pages: 354-356 |
|
doi>10.1145/345508.345648 |
|
Full text: PDF
|
|
We present our framework for classification of Web news, based on support vector machines, and some of the initial measurements of its accuracy.
We present our framework for classification of Web news, based on support vector machines, and some of the initial measurements of its accuracy.
expand
|
|
|
Stemming and its effects on TFIDF ranking (poster session) |
| |
Mark Kantrowitz,
Behrang Mohit,
Vibhu Mittal
|
|
Pages: 357-359 |
|
doi>10.1145/345508.345650 |
|
Full text: PDF
|
|
|
|
|
Exploration of a heuristic approach to threshold learning in adaptive filtering (poster session) |
| |
Chengxiang Zhai,
Peter Jansen,
David A. Evans
|
|
Pages: 360-362 |
|
doi>10.1145/345508.345652 |
|
Full text: PDF
|
|
In this paper we examine the learning behavior of a heuristic threshold setting approach to information filtering. In particular, we study how different initial threshold settings and different updating parameter settings affect threshold learning. The ...
In this paper we examine the learning behavior of a heuristic threshold setting approach to information filtering. In particular, we study how different initial threshold settings and different updating parameter settings affect threshold learning. The results on one of the TREC news databases indicate that (1) learning allows recovery from the inevitable non-optimality of the initial conditions, and (2) a greater “willingness to learn” (expressed by a deliberate lowering of the score threshold in the learning stage) does eventually lead to a higher performance in spite of the expected initial performance penalty.
expand
|
|
|
On the design and evaluation of a multi-dimensional approach to information retrieval (poster session) |
| |
M. Catherine McCabe,
Jinho Lee,
Abdur Chowdhury,
David Grossman,
Ophir Frieder
|
|
Pages: 363-365 |
|
doi>10.1145/345508.345656 |
|
Full text: PDF
|
|
We present a method of searching text collections that takes advantage of hierarchrical information within documents and integrates searches of structured and unstructured data. We show that Multidimensional databases (MDB), designed for accessing ...
We present a method of searching text collections that takes advantage of hierarchrical information within documents and integrates searches of structured and unstructured data. We show that Multidimensional databases (MDB), designed for accessing data along hierarchical dimensions, are effective for information retrieval. We demonstrate a method of using On-Line Analytic Processing (OLAP) techniques on a text collection. This combines traditional information retrieval and the slicing, dicing, drill-down, and roll-up of OLAP. We demonstrate use of a prototype for searching documents from the TREC collection.
expand
|
|
|
SWAMI (poster session): a framework for collaborative filtering algorithm development and evaluation |
| |
Danyel Fisher,
Kris Hildrum,
Jason Hong,
Mark Newman,
Megan Thomas,
Rich Vuduc
|
|
Pages: 366-368 |
|
doi>10.1145/345508.345658 |
|
Full text: PDF
|
|
We present a Java-based framework, SWAMI (Shared Wisdom through the Amalgamation of Many Interpretations) for building and studying collaborative filtering systems. SWAMI consists of three components: a prediction engine, an evaluation system, and a ...
We present a Java-based framework, SWAMI (Shared Wisdom through the Amalgamation of Many Interpretations) for building and studying collaborative filtering systems. SWAMI consists of three components: a prediction engine, an evaluation system, and a visualization component. The prediction engine provides a common interface for implementing different prediction algorithms. The evaluation system provides a standardized testing methodology and metrics for analyzing the accuracy and run-time performance of prediction algorithms. The visualization component suggests how graphical representations can inform the development and analysis of prediction algorithms. We demonstrate SWAMI on the Each Movie data set by comparing three prediction algorithms: a traditional Pearson correlation-based method, support vector machines, and a new accurate and scalable correlation-based method based on clustering techniques.
expand
|
|
|
Learning probabilistic models of the Web (poster session) |
| |
Thomas Hofmann
|
|
Pages: 369-371 |
|
doi>10.1145/345508.345660 |
|
Full text: PDF
|
|
In the World Wide Web, myriads of hyperlinks connect documents and pages to create an unprecedented, highly complex graph structure - the Web graph. This paper presents a novel approach to learning probabilistic models of the Web, which can be used to ...
In the World Wide Web, myriads of hyperlinks connect documents and pages to create an unprecedented, highly complex graph structure - the Web graph. This paper presents a novel approach to learning probabilistic models of the Web, which can be used to make reliable predictions about connectivity and information content of Web documents. The proposed method is a probabilistic dimension reduction technique which recasts and unites Latent Semantic Analysis and Kleinberg's Hubs-and-Authorities algorithm in a statistical setting.
This meant to be a first step towards the development of a statistical foundation for Web—related information technologies. Although this paper does not focus on a particular application, a variety of algorithms operating in the Web/Internet environment can take advantage of the presented techniques, including search engines, Web crawlers, and information agent systems.
expand
|
|
|
Effects of out of vocabulary words in spoken document retrieval (poster session) |
| |
P. C. Woodland,
S. E. Johnson,
P. Jourlin,
K. Spärck Jones
|
|
Pages: 372-374 |
|
doi>10.1145/345508.345661 |
|
Full text: PDF
|
|
The effects of out-of-vocabulary (OOV) items in spoken document retrieval (SDR) are investigated. Several sets of transcriptions were created for the TREC-8 SDR task using a speech recognition system varying the vocabulary sizes and OOV rates, and the ...
The effects of out-of-vocabulary (OOV) items in spoken document retrieval (SDR) are investigated. Several sets of transcriptions were created for the TREC-8 SDR task using a speech recognition system varying the vocabulary sizes and OOV rates, and the relative retrieval performance measured. The effects of OOV terms on a simple baseline IR system and on more sophisticated retrieval systems are described. The use of a parallel corpus for query and document expansion is found to be especially beneficial, and with this data set, good retrieval performance can be achieved even for fairly high OOV rates.
expand
|
|
|
Towards an adaptive and task-specific ranking mechanism in Web searching (poster session) |
| |
Chen Ding,
Chi-Hung Chi
|
|
Pages: 375-376 |
|
doi>10.1145/345508.345663 |
|
Full text: PDF
|
|
|
|
|
Beyond the traditional query operators (poster session) |
| |
Chen Ding,
Chi-Hung Chi
|
|
Pages: 377-378 |
|
doi>10.1145/345508.345664 |
|
Full text: PDF
|
|
|
|
|
Bayes optimal metasearch: a probabilistic model for combining the results of multiple retrieval systems (poster session) |
| |
Javed A. Aslam,
Mark Montague
|
|
Pages: 379-381 |
|
doi>10.1145/345508.345665 |
|
Full text: PDF
|
|
We introduce a new, probabilistic model for combining the outputs of an arbitrary number of query retrieval systems. By gathering simple statistics on the average performance of a given set of query retrieval systems, we construct a Bayes optimal mechanism ...
We introduce a new, probabilistic model for combining the outputs of an arbitrary number of query retrieval systems. By gathering simple statistics on the average performance of a given set of query retrieval systems, we construct a Bayes optimal mechanism for combining the outputs of these systems. Our construction yields a metasearch strategy whose empirical performance nearly always exceeds the performance of any of the constituent systems. Our construction is also robust in the sense that if “good” and “bad” systems are combined, the Performance of the composite is still on par with, or exceeds, that of the best constituent system. Finally, our model and theory provide theoretical and empirical avenues for the improvement of this metasearch strategy.
expand
|
|
|
Information access for context-aware appliances (poster session) |
| |
Gareth J. F. Jones,
Peter J. Brown
|
|
Pages: 382-384 |
|
doi>10.1145/345508.345666 |
|
Full text: PDF
|
|
The emergence of networked context-aware mobile computing appliances potentially offers opportunities for remote access to huge online information resources. Information access in context-aware information appliances can utilize existing techniques developed ...
The emergence of networked context-aware mobile computing appliances potentially offers opportunities for remote access to huge online information resources. Information access in context-aware information appliances can utilize existing techniques developed for effective information retrieval and information filtering; however, practical physical and operational features of these devices and the availability of context information itself suggest that the document selection process should make use of this contextual data.
expand
|
|
|
Finding relevant passages using noun-noun compounds (poster session): coherence vs. proximity |
| |
Eduard Hoenkamp,
Rob de Groot
|
|
Pages: 385-387 |
|
doi>10.1145/345508.345667 |
|
Full text: PDF
|
|
Intuitively, words forming phrases are a more precise description of content than words as a sequence of keywords. Yet, evidence that phrases would be more effective for information retrieval is inconclusive. This paper isolates a neglected class of ...
Intuitively, words forming phrases are a more precise description of content than words as a sequence of keywords. Yet, evidence that phrases would be more effective for information retrieval is inconclusive. This paper isolates a neglected class of phrases, that is abundant in communication, has an established theoretical foundation, and shows promise for an effective expression of the user's information need: the noun-noun compound (NNC). In an experiment, a variety of meaningful NNCs were used to isolate relevant passages in a large and varied corpus. In a first pass, passages were retrieved based on textual proximity of the words or their semantic peers. A second pass retained only passages containing a syntactically coherent structure equivalent to the original NNC. This second pass showed a dramatic increase in precision. Preliminary results show the validity of our intuition about phrases in the special but very productive case of NNCs.
expand
|
|
|
Semantic Explorer — navigation in documents collections; Proxima Daily — learning personal newspaper (demonstration session) |
| |
Vadim Asadov,
Serge Shumsky
|
|
Page: 388 |
|
doi>10.1145/345508.345668 |
|
Full text: PDF
|
|
|
|
|
Integrated search tools for newspaper digital libraries (demonstration session) |
| |
S. L. Mantzaris,
B. Gatos,
N. Gouraros,
P. Tzavelis
|
|
Page: 389 |
|
doi>10.1145/345508.345670 |
|
Full text: PDF
|
|
|
|
|
Managing photos with AT&T Shoebox (demonstration session) |
| |
Timothy J. Mills,
David Pye,
David Sinclair,
Kenneth R. Wood
|
|
Page: 390 |
|
doi>10.1145/345508.345671 |
|
Full text: PDF
|
|
|
|
|
ClusterBook, a tool for dual information access (demonstration session) |
| |
Gheorghe Mureşan,
David J. Harper,
Ayşe Göker,
Peter Lowit
|
|
Page: 391 |
|
doi>10.1145/345508.345672 |
|
|
|
|
Uexküll (demonstration session): an interactive visual user interface for document retrieval in vector space |
| |
Michael Preminger,
Sandor Daranyi
|
|
Page: 392 |
|
doi>10.1145/345508.345673 |
|
Full text: PDF
|
|
|
|
|
TimeMine (demonstration session): visualizing automatically constructed timelines |
| |
Russell Swan,
James Allan
|
|
Page: 393 |
|
doi>10.1145/345508.345674 |
|
Full text: PDF
|
|
|
|
|
The Cambridge University Multimedia Document Retrieval demo system (demonstration session) |
| |
A. Tuerk,
S. E. Johnson,
P. Jourlin,
K. Spärck Jones,
P. C. Woodland
|
|
Page: 394 |
|
doi>10.1145/345508.345675 |
|
Full text: PDF
|
|
|