No abstract available.
Improving automatic Chinese text categorization by error correction
In this paper we use the miss-classified news in training data as a feedback to improve the classification accuracy. We isolate the miss-classified news from the news of original classes to form new subclasses, and modify Rocchio linear classifier by ...
Web page classification based on k-nearest neighbor approach
Automatic categorization is the only viable method to deal with the scaling problem of the World Wide Web. In this paper, we propose a Web page classifier based on an adaptation of k-Nearest Neighbor (k-NN) approach. To improve the performance of k-NN ...
Combining multiple sources for short query translation in Chinese-English cross-language information retrieval
In this paper, we examine various factors that affect the retrieval performance of Chinese-English cross-language retrieval. The factors include segmentation dictionary coverage, segmentation algorithm, transfer dictionary coverage, transfer dictionary ...
Query term disambiguation for Web cross-language information retrieval using a search engine
With the worldwide growth of the Internet, research on Cross-Language Information Retrieval (CLIR) is being paid much attention. Existing CLIR approaches based on query translation require parallel corpora or comparable corpora for the disambiguation of ...
Explorative multilingual text retrieval based on fuzzy multilingual keyword classification
This paper proposes an explorative approach to multilingual text retrieval (MLTR) based on fuzzy multilingual keyword classification. The approach applies fuzzy clustering to obtain a classification of multilingual keywords by concepts. A multilingual ...
MIETTA — a framework for uniform and multilingual access to structured database and Web information
We describe a WWW-based information system called MIETTA, which allows uniform and multilingual access to heterogenous data sources in the tourism domain. The design of the search engine is based on a new crosslingual framework. The framework integrates ...
Hybrid term indexing for different IR models
Retrieval effectiveness depends on how terms are extracted and indexed. For Chinese text (and others like Japanese and Korean), there are no space to delimit words. Indexing using hybrid terms (i.e. words and bigrams) were able to achieve the best ...
PM-based indexing for Chinese text retrieval
This paper focused on introducing a novel PM indexing schema for Chinese text retrieval. Different with the Western languages, there is no delimiter between words in Chinese texts. The indexing is based either on the characters or on the segmented ...
An efficient accessing technique of Chinese characters using Boshiamy Chinese input system
In this paper, a new efficient technique for Chinese character retrieval is proposed. This technique designs a minimal perfect hashing function based on the Chinese remainder theorem for a simply and widely used Chinese input system called Boshiamy ...
Improvement of vector space information retrieval model based on supervised learning
This paper proposes and method to improve retrieval performance of the vector space model (VSM) by utilizing user-supplied information of those documents that are relevant to the query in question. In addition to the user's relevance feedback ...
Character cluster based Thai information retrieval
Some languages including Thai, Japanese and Chinese do not have explicit word boundary. This causes the problem of word boundary ambiguity that results in decreasing the accuracy of information retrieval. This paper proposes a new technique so-called ...
Japanese probabilistic information retrieval using location and category information
Robertson's 2-poisson information retrieve model does not use location and category information. We constructed a framework using location and category information in a 2-poisson model. We submitted two systems based on this framework to the IREX ...
Query expansion using phonetic confusions for Chinese spoken document retrieval
This paper presents a method of query expansion based on phonetic confusions for retrieving spoken documents using text queries. This method is applied to a Chinese spoken document retrieval task. A series of experiments have been carried out for ...
A first step towards flexible local feedback for ad hoc retrieval
Local feedback for ad hoc retrieval typically hurts performance for about one-third of the search requests while improving the average performance. Our objective is to make it more reliable by estimating the optimal number of assumed-relevant documents ...
Information extraction for Thai documents
An increasing amount of electronically available information is stored in Asian language documents, which makes Information Retrieval (IR) and Information Extraction (IE) for these languages important for a large number of users. Analysis and extraction ...
Korean text summarization using an aggregate similarity
In this paper, each document is represented by a weighted graph called a text relationship map. In the graph, each node represents a vector of nouns in a sentence, an undirected link connects two nodes if two sentences are semantically related, and a ...
Research on a faster algorithm for pattern matching
Based on deep analysis of Boyer-Moore algorithm and Quick Search algorithm, we propose a faster algorithm for single pattern matching by utilizing the continuous skip over the text, this idea enables its high performance because of the large shift on ...
Dynamic programming: a method for taking advantage of technical terminology in Japanese documents
We introduce a new similarity measure based on dynamic programming, intended for technical terms such as machine translation system, which are quite common in technical writing. We compare our proposal with systems which use standard IDF cosine ...
Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval
In Korean text, recently, the use of English words with or without phonetic translation is growing at high speed. To make matters worse the Korean transliterations of an English word may be very various. The mixed use of English words and their various ...
On the use of words and n-grams for Chinese information retrieval
In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams have been used as indexes in several previous studies, which showed that both kinds of indexes ...
Content-based language models for spoken document retrieval
Spoken document retrieval (SDR) has been extensively studied in recent years because of its potential use in navigating large multimedia collections in the near future. This paper presents a novel concept of applying the content-based language models to ...
Structural analysis of cooking preparation steps in Japanese
We propose a method to create process flow graphs automatically from textbooks for cooking programs. This is realized by understanding context by narrowing down the domain to cooking, and making use of domain specific constraints and knowledge. Since it ...
Topic detection and tracking in English and Chinese
Topic Detection and Tracking (TDT) refers to automatic techniques for discovering, threading, and retrieving topically related material in streams of data. Newswire and broadcast news are the canonical sources. In 1999, TDT research was extended from ...
Exploiting a Chinese-English bilingual wordlist for English-Chinese cross language information retrieval
We investigated using the LDC English/Chinese bilingual wordlists for English-Chinese cross language retrieval. It is shown that the Chinese-to-English wordlist can be considered as both a phrase and word dictionary, and is preferable to the English-to-...
MT-based Japanese-Enlish cross-language IR experiments using the TREC test collections
This paper evaluates the effectiveness of MT-based Japanese-English CLIR using a subcollection of the TREC test collections and two bilingual researchers to separately translate the TREC requests into Japanese. Our main findings are as follows: (1)With ...
Construction of a Chinese-English WordNet and its application to CLIR
This paper integrates five linguistic resources, including Cilin, a Chinese-English dictionary, ASBC corpus, SemCor, and WordNet, to construct a Chinese-English WordNet. The result is employed in Chinese-English information retrieval. Under TREC-6 text ...
Comparison of word-based and syllable-based retrieval for Tibetan (poster session)
Tibetan retrieval based on automatically segmented words is compared with the use of overlapping syllable n-grams using a known-item retrieval evaluation. The optimal span of fixed-length n-grams is found to be 2 syllables, and indexing words is found ...
Effect of dependency relationships and ordered co-occurrence of words on Japanese information retrival (poster session)
We propose two Japanese information retrieval methods that enhance retrieval effectiveness using relationships between words. One is a method using dependency relationships between words in a sentence, and another is a method using the ordered co-...
Automatic text summarization based on relevance feedback with query splitting (poster session)
This paper describes a method of text summarization using a query expansion technique. Generally, summarization systems using query expansion have the problem that feedback query gets biased during a query expansion process. We can alleviate this ...
Automatic recommendation of hot topics in discussion-type newsgroups (poster session)
We are developing an intelligent network news reader intended to help people use discussion-type network news more effectively. It is called HISHO and will assist users to find whole threads in Japanese discussions that are relevant to the users' ...
Index Terms
Proceedings of the fifth international workshop on on Information retrieval with Asian languages




