Contact The DL Team Contact Us | Switch to tabbed view

top of pageABSTRACT

This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.
Advertisements



top of pageAUTHORS



Author image not provided  Thorsten Brants

No contact information provided yet.

Bibliometrics: publication history
Publication years1995-2009
Publication count15
Citation Count598
Available for download14
Downloads (6 Weeks)55
Downloads (12 Months)591
Downloads (cumulative)8,194
Average downloads per article585.29
Average citations per article39.87
View colleagues of Thorsten Brants


Author image not provided  Francine Chen

No contact information provided yet.

Bibliometrics: publication history
Publication years1989-2016
Publication count36
Citation Count618
Available for download29
Downloads (6 Weeks)92
Downloads (12 Months)921
Downloads (cumulative)13,586
Average downloads per article468.48
Average citations per article17.17
View colleagues of Francine Chen


Author image not provided  Ioannis Tsochantaridis

No contact information provided yet.

Bibliometrics: publication history
Publication years2002-2005
Publication count7
Citation Count692
Available for download3
Downloads (6 Weeks)25
Downloads (12 Months)330
Downloads (cumulative)5,301
Average downloads per article1,767.00
Average citations per article98.86
View colleagues of Ioannis Tsochantaridis

top of pageREFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Basu, I.R. Harris, and S. Basu. Minimum distance estimation: The approach using density-based distances. In G.S. Maddala and C.R. Rao, editors, Handbook of Statistics volume 15,pages 21--48. North-Holland, 1997.
 
2
 
3
D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In Proceedings of NIPS-2001 Vancuver, BC, Canada, 2001.
 
4
T.Brants.Test data likelihood for PLSA models. In ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval Tampere, Finland, 2002.
 
5
 
6
F.Y.Y. Choi. Improving the efficiency of speech interfaces for text navigation. In Proceedings of the IEE colloquium: Speech and Language Processing for Disabled and Elderly People 2000.
 
7
F.Y.Y. Choi, P.Wiemer-Hastings, and J.More. Latent semantic analysis for text segmentation. In L.Lee and D.Harman, editors, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing pages 109--117, 2001.
 
8
W.B. Croft, S.Cronen-Townsend, and V. Larvrenk. Relevance feedback and personalization: A language modeling perspective. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries 2001.
 
9
S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6):391--407, 1990.
 
10
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39(1):1--21,1977.
 
11
D. Gildea and T. Hofmann. Topic-based language models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.
12
13
 
14
 
15
T. Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Tech., COM-15:52--60,1967.
 
16
 
17
S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22:79--86, 1951.
 
18
V. Lavrenk, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas. Topic-based language models using em. In Proceedings ofthe 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.
 
19
 
20
 
21
H. Li and K. Yamanishi. Topic analysis using a finite mixture model. IPSJ SIGNotes Natural Language (NL), 139(009), 2000.
 
22
 
23
J.W. Tukey. Exploratory Data Analysis Addison Wesley Longman,Inc., Reading, MA, 1977.

top of pageCITED BY

38 Citations

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

top of pageINDEX TERMS

The ACM Computing Classification System (CCS rev.2012)

Note: Larger/Darker text within each node indicates a higher relevance of the materials to the taxonomic classification.

top of pagePUBLICATION

Title CIKM '02 Proceedings of the eleventh international conference on Information and knowledge management table of contents
General Chairs Charles Nicholas University of Maryland Baltimore County
Program Chairs David Grossman Illinois Institute of Technology
Konstantinos Kalpakis University of Maryland Baltimore County
Sajda Qureshi Erasmus University, Rotterdam
Han van Dissel Erasmus University, Rotterdam
Len Seligman The MITRE Corporation
Pages 211-218
Publication Date2002-11-04 (yyyy-mm-dd)
Sponsors SIGIR ACM Special Interest Group on Information Retrieval
SIGMIS ACM Special Interest Group on Management Information Systems
ACM Association for Computing Machinery
PublisherACM New York, NY, USA ©2002
ISBN: 1-58113-492-4 Order Number: 605020 doi>10.1145/584792.584829
Conference CIKMConference on Information and Knowledge Management CIKM logo
Overall Acceptance Rate 1,482 of 8,376 submissions, 18%
Year Submitted Accepted Rate
CIKM '05 425 77 18%
CIKM '06 537 81 15%
CIKM '07 512 86 17%
CIKM '08 772 132 17%
CIKM '09 847 123 15%
CIKM '10 945 126 13%
CIKM '11 918 228 25%
CIKM '12 1088 146 13%
CIKM '13 848 143 17%
CIKM '14 838 175 21%
CIKM '15 646 165 26%
Overall 8,376 1,482 18%

APPEARS IN
Artificial Intelligence
Digital Content
Operations and Management

top of pageREVIEWS


Reviews are not available for this item
Computing Reviews logo

top of pageCOMMENTS

Be the first to comment To Post a comment please sign in or create a free Web account

top of pageTable of Contents

Proceedings of the eleventh international conference on Information and knowledge management
Table of Contents
On scalable information retrieval systems
Ophir Frieder
Pages: 1-1
doi>10.1145/584792.584793
Full text: PDFPDF

Implementing scalable information retrieval systems requires the design and development of efficient methods to ingest data from multiple sources, search and retrieve results from both English and foreign language document collections and from collections ...
expand
SESSION: Pattern discovery and forecasting
F4: large-scale automated forecasting using fractals
Deepayan Chakrabarti, Christos Faloutsos
Pages: 2-9
doi>10.1145/584792.584797
Full text: PDFPDF

Forecasting has attracted a lot of research interest, with very successful methods for periodic time series. Here, we propose a fast, automated method to do non-linear forecasting, for both periodic as well as chaotic time series. We use the technique ...
expand
An iterative strategy for pattern discovery in high-dimensional data sets
Chun Tang, Aidong Zhang
Pages: 10-17
doi>10.1145/584792.584798
Full text: PDFPDF

High-dimensional data representation in which each data item (termed target object) is described by many features, is a necessary component of many applications. For example, in DNA microarrays, each sample (target object) is represented by thousands ...
expand
Mining sequential patterns with constraints in large databases
Jian Pei, Jiawei Han, Wei Wang
Pages: 18-25
doi>10.1145/584792.584799
Full text: PDFPDF

Constraints are essential for many sequential pattern mining applications. However, there is no systematic study on constraint-based sequential pattern mining. In this paper, we investigate this issue and point out that the framework developed ...
expand
SESSION: Web search 1
Searching web databases by structuring keyword-based queries
Pável Calado, Altigran S. da Silva, Rodrigo C. Vieira, Alberto H. F. Laender, Berthier A. Ribeiro-Neto
Pages: 26-33
doi>10.1145/584792.584801
Full text: PDFPDF

On-line information services have become widespread in the Web nowadays. However, Web users are non-specialized and have a great variety of interests. Thus, interfaces for Web databases must be simple and uniform. In this paper we present an approach, ...
expand
Topic-oriented collaborative crawling
Chiasen Chung, Charles L. A. Clarke
Pages: 34-42
doi>10.1145/584792.584802
Full text: PDFPDF

A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual ...
expand
Meta-recommendation systems: user-controlled integration of diverse recommendations
J. Ben Schafer, Joseph A. Konstan, John Riedl
Pages: 43-51
doi>10.1145/584792.584803
Full text: PDFPDF

In a world where the number of choices can be overwhelming, recommender systems help users find and evaluate items of interest. They do so by connecting users with information regarding the content of recommended items or the opinions of other individuals. ...
expand
Removing redundancy and inconsistency in memory-based collaborative filtering
Kai Yu, Xiaowei Xu, Anton Schwaighofer, Volker Tresp, Hans-Peter Kriegel
Pages: 52-59
doi>10.1145/584792.584804
Full text: PDFPDF

The application range of memory-based collaborative filtering (CF) is limited due to CF's high memory consumption and long runtime. The approach presented in this paper removes redundant and inconsistent instances (users) from the data. This paper aims ...
expand
SESSION: Data warehousing and OLAP
Analysis of pre-computed partition top method for range top-k queries in OLAP data cubes
Zheng Xuan Loh, Tok Wang Ling, Chuan Heng Ang, Sin Yeung Lee
Pages: 60-67
doi>10.1145/584792.584806
Full text: PDFPDF

In decision support systems, having knowledge on the top k values is more informative and crucial than the maximum value. Unfortunately, the naive method involves high computational cost and the existing methods for range-max query are inefficient ...
expand
Batch data warehouse maintenance in dynamic environments
Bin Liu, Songting Chen, Elke A. Rundensteiner
Pages: 68-75
doi>10.1145/584792.584807
Full text: PDFPDF

Data warehouse view maintenance is an important issue due to the growing use of warehouse technology for information integration and data analysis. Given the dynamic nature of modern distributed environments, both data updates and schema changes are ...
expand
A fast filtering scheme for large database cleansing
Sam Y. Sung, Zhao Li, Peng Sun
Pages: 76-83
doi>10.1145/584792.584808
Full text: PDFPDF

Existing data cleansing methods are costly and will take very long time to cleanse large databases. Since large databases are common nowadays, it is necessary to reduce the cleansing time. Data cleansing consists of two main components, detection method ...
expand
Semantic-based delivery of OLAP summary tables in wireless environments
Mohamed A. Sharaf, Panos K. Chrysanthis
Pages: 84-92
doi>10.1145/584792.584809
Full text: PDFPDF

With the rapid growth in mobile and wireless technologies and the availability, pervasiveness and cost effectiveness of wireless networks, mobile computers are quickly becoming the normal front-end devices for accessing enterprise data. In this paper, ...
expand
Future directions in data mining: streams, networks, self-similarity and power laws
Christos Faloutsos
Pages: 93-93
doi>10.1145/584792.584794
Full text: PDFPDF

How to spot abnormalities in a stream of temperature data from a sensor? Or from a network of sensors? How does the Internet look like? Are there 'abnormal' sub-graphs in a given social network, possibly indicating, e.g., money-laundering rings?We present ...
expand
SESSION: Image similarity search systems
Symbolic photograph content-based retrieval
Philippe Mulhem, Joo Hwee Lim
Pages: 94-101
doi>10.1145/584792.584811
Full text: PDFPDF

Photograph retrieval systems face the difficulty to deal with the different ways to apprehend the content of images. We consider and demonstrate here the use of multiple index representations of photographs to achieve effective retrieval. The use of ...
expand
A compact and efficient image retrieval approach based on border/interior pixel classification
Renato O. Stehling, Mario A. Nascimento, Alexandre X. Falcão
Pages: 102-109
doi>10.1145/584792.584812
Full text: PDFPDF

This paper presents \bic (Border/Interior pixel Classification), a compact and efficient CBIR approach suitable for broad image domains. It has three main components: (1) a simple and powerful image analysis algorithm that classifies ...
expand
Vulnerabilities in similarity search based systems
Ali Saman Tosun, Hakan Ferhatosmanoglu
Pages: 110-117
doi>10.1145/584792.584813
Full text: PDFPDF

Similarity based queries are common in several modern database applications, such as multimedia, scientific, and biomedical databases. In most of these systems, database responds with the tuple with the closest match according to some metric. In this ...
expand
SESSION: XML query processing
Efficient evaluation of multiple queries on streaming XML data
Mong Li Lee, Boon Chin Chua, Wynne Hsu, Kian-Lee Tan
Pages: 118-125
doi>10.1145/584792.584815
Full text: PDFPDF

Traditionally, XML documents are processed at where they are stored. This allows the query processor to exploit pre-computed data structures (e.g., index) to retrieve the desired data efficiently. However, this mode of processing is not suitable for ...
expand
Query processing of streamed XML data
Leonidas Fegaras, David Levine, Sujoe Bose, Vamsi Chaluvadi
Pages: 126-133
doi>10.1145/584792.584816
Full text: PDFPDF

We are addressing the efficient processing of continuous XML streams, in which the server broadcasts XML data to multiple clients concurrently through a multicast data stream, while each client is fully responsible for processing the stream. In our framework, ...
expand
Multi-level operator combination in XML query processing
Shurug Al-Khalifa, H. V. Jagadish
Pages: 134-141
doi>10.1145/584792.584817
Full text: PDFPDF

A core set of efficient access methods is central to the development of any database system. In the context of an XML database, there has been considerable effort devoted to defining a good set of primitive operators and inventing efficient access methods ...
expand
SESSION: XML transactions
XMLTM: efficient transaction management for XML documents
Torsten Grabs, Klemens Böhm, Hans-Jörg Schek
Pages: 142-152
doi>10.1145/584792.584819
Full text: PDFPDF

A common approach to storage and retrieval of XML documents is to store them in a database, together with materialized views on their content. The advantage over "native" XML storage managers seems to be that transactions and concurrency are for free, ...
expand
Efficient synchronization for mobile XML data
Franky Lam, Nicole Lam, Raymond Wong
Pages: 153-160
doi>10.1145/584792.584820
Full text: PDFPDF

Many handheld applications receive data from a primary database server and operate in an intermittently connected environment these days. They maintain data consistency with data sources through sychronization. In certain applications such as sales force ...
expand
An object-oriented extension of XML for autonomous web applications
Hasan M. Jamil, Giovanni A. Modica
Pages: 161-168
doi>10.1145/584792.584821
Full text: PDFPDF

While the idea of extending XML to include object-oriented features has been gaining popularity in general, the potential of inheritance in document design has not been well recognized in contemporary research. In this paper we demonstrate that XML with ...
expand
SESSION: Caching
Efficient prediction of web accesses on a proxy server
Wenwu Lou, Hongjun Lu
Pages: 169-176
doi>10.1145/584792.584823
Full text: PDFPDF

Web access prediction is an active research topic with many applications. Various approaches have been proposed for Web access prediction in the domain of individual Web servers but they have to be tailored to the domain of proxy servers to satisfy its ...
expand
A self-managing data cache for edge-of-network web applications
Khalil Amiri, Sanghyun Park, Renu Tewari
Pages: 177-185
doi>10.1145/584792.584824
Full text: PDFPDF

Database caching at proxy servers enables dynamic content to be generated at the edge of the network, thereby improving the scalability and response time of web applications. The scale of deployment of edge servers coupled with the rising costs of their ...
expand
Cooperative caching by mobile clients in push-based information systems
Takahiro Hara
Pages: 186-193
doi>10.1145/584792.584825
Full text: PDFPDF

Recent advances in computer and wireless communication technologies have increased interest in push-based information systems in which a server repeatedly broadcasts data to clients through a broadband channel. In this paper, assuming an environment ...
expand
SESSION: Information extraction and text segmentation
AuGEAS: authoritativeness grading, estimation, and sorting
Ayman Farahat, Geoff Nunberg, Francine Chen
Pages: 194-202
doi>10.1145/584792.584827
Full text: PDFPDF

When searching for content in in a large heterogeneous document collections like the World Wide Web it is not easy to know which documents provide reliable authoritative information about a subject. The problem is particularly pointed as it concerns ...
expand
Structural extraction from visual layout of documents
Binyamin Rosenfeld, Ronen Feldman, Yonatan Aumann
Pages: 203-210
doi>10.1145/584792.584828
Full text: PDFPDF

Most information extraction systems focus on the textual content of the documents. They treat documents as sequences or of words, disregarding the physical and typographical layout of the information.. While this strategy helps in focusing the extraction ...
expand
Topic-based document segmentation with probabilistic latent semantic analysis
Thorsten Brants, Francine Chen, Ioannis Tsochantaridis
Pages: 211-218
doi>10.1145/584792.584829
Full text: PDFPDF

This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) ...
expand
SESSION: Sequence similarity search and access methods
How to improve the pruning ability of dynamic metric access methods
Caetano Traina, Jr., Agma Traina, Roberto Santos Filho, Christos Faloutsos
Pages: 219-226
doi>10.1145/584792.584831
Full text: PDFPDF

Complex data retrieval is accelerated using index structures, which organize the data in order to prune comparisons between data during queries. In metric spaces, comparison operations can be specially expensive, so the pruning ability of indexing methods ...
expand
On the efficient evaluation of relaxed queries in biological databases
Yangjun Chen, Duren Che, Karl Aberer
Pages: 227-236
doi>10.1145/584792.584832
Full text: PDFPDF

In this paper, a new technique is developed to support the query relaxation in biological databases. Query relaxation is required due to the fact that queries tend not to be expressed exactly by the users, especially in scientific databases such as biological ...
expand
Similarity based retrieval from sequence databases using automata as queries
A. Prasad Sistla, Tao Hu, Vikas Chowdhry
Pages: 237-244
doi>10.1145/584792.584833
Full text: PDFPDF

Similarity based retrieval from sequence databases is of importance in many applications such as time-series, video and textual databases. In this paper, automata based formalisms are introduced for specifying queries over such databases. Various measures ...
expand
SESSION: Information retrieval models
Detecting similar documents using salient terms
James W. Cooper, Anni R. Coden, Eric W. Brown
Pages: 245-251
doi>10.1145/584792.584835
Full text: PDFPDF

We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. ...
expand
The role of variance in term weighting for probabilistic information retrieval
Warren R. Greiff, William T. Morgan, Jay M. Ponte
Pages: 252-259
doi>10.1145/584792.584836
Full text: PDFPDF

In probabilistic approaches to information retrieval, the occurrence of a query term in a document contributes to the probability that the document will be judged relevant. It is typically assumed that the weight assigned to a query term should be based ...
expand
Inferring query models by computing information flow
P. D. Bruza, D. Song
Pages: 260-269
doi>10.1145/584792.584837
Full text: PDFPDF

The language modelling approach to information retrieval can also be used to compute query models. A query model can be envisaged as an expansion of an initial query. The more prominent query models in the literature have a probabilistic basis. This ...
expand
SESSION: XML schemas: integration and translation
Logical and physical support for heterogeneous data
Sihem Amer-Yahia, Mary Fernández, Rick Greer, Divesh Srivastava
Pages: 270-281
doi>10.1145/584792.584839
Full text: PDFPDF

Heterogeneity arises naturally in virtually all real-world data. This paper presents evolutionary extensions to a relational database system for supporting three classes of data heterogeneity: variational, structural and annotational heterogeneities. ...
expand
NeT & CoT: translating relational schemas to XML schemas using semantic constraints
Dongwon Lee, Murali Mani, Frank Chiu, Wesley W. Chu
Pages: 282-291
doi>10.1145/584792.584840
Full text: PDFPDF

Two algorithms, called NeT and CoT, to translate relational schemas to XML schemas using various semantic constraints are presented. The XML schema representation we use is a language-independent formalism named XSchema, that is both precise and concise. ...
expand
XClust: clustering XML schemas for effective integration
Mong Li Lee, Liang Huai Yang, Wynne Hsu, Xia Yang
Pages: 292-299
doi>10.1145/584792.584841
Full text: PDFPDF

It is increasingly important to develop scalable integration techniques for the growing number of XML data sources. A practical starting point for the integration of large numbers of Document Type Definitions (DTDs) of XML sources would be to first find ...
expand
A local search mechanism for peer-to-peer networks
Vana Kalogeraki, Dimitrios Gunopulos, D. Zeinalipour-Yazti
Pages: 300-307
doi>10.1145/584792.584842
Full text: PDFPDF

One important problem in peer-to-peer (P2P) networks is searching and retrieving the correct information. However, existing searching mechanisms in pure peer-to-peer networks are inefficient due to the decentralized nature of such networks. We propose ...
expand
Intelligent knowledge discovery in peer-to-peer file sharing
Yugyung Lee, Changgyu Oh, Eun Kyo Park
Pages: 308-315
doi>10.1145/584792.584843
Full text: PDFPDF

Emerging peer-to-peer computing provides new possibilities but also challenges for distributed applications. Despite their significant potential, current peer-to-peer networks lack efficient knowledge discovery and management. This paper addresses this ...
expand
Partial rollback in object-oriented/object-relational database management systems
Won-Young Kim, Kyu-Young Whang, Byung Suk Lee, Young-Koo Lee, Ji-Woong Chang
Pages: 316-323
doi>10.1145/584792.584844
Full text: PDFPDF

In a database management system (DBMS), partial rollback is an important mechanism for canceling only part of the operations executed in a transaction back to a savepoint. Partial rollback complicates buffer management because it should restore the state ...
expand
SESSION: Information retrieval 1
Query association for effective retrieval
Falk Scholer, Hugh E. Williams
Pages: 324-331
doi>10.1145/584792.584846
Full text: PDFPDF

We introduce a novel technique for document summarisation which we call query association. Query association is based on the notion that a query that is highly similar to a document is a good descriptor of that document. For example, the user query "richmond ...
expand
Pruning long documents for distributed information retrieval
Jie Lu, Jamie Callan
Pages: 332-339
doi>10.1145/584792.584847
Full text: PDFPDF

Query-based sampling is a method of discovering the contents of a text database by submitting queries to a search engine and observing the documents returned. In prior research sampled documents were used to build resource descriptions for automatic ...
expand
On arabic search: improving the retrieval effectiveness via a light stemming approach
Mohammed Aljlayl, Ophir Frieder
Pages: 340-347
doi>10.1145/584792.584848
Full text: PDFPDF

The inflectional structure of a word impacts the retrieval accuracy of information retrieval systems of Latin-based languages. We present two stemming algorithms for Arabic information retrieval systems. We empirically investigate the effectiveness of ...
expand
SESSION: Classification
Boosting to correct inductive bias in text classification
Yan Liu, Yiming Yang, Jaime Carbonell
Pages: 348-355
doi>10.1145/584792.584850
Full text: PDFPDF

This paper studies the effects of boosting in the context of different classification methods for text categorization, including Decision Trees, Naive Bayes, Support Vector Machines (SVMs) and a Rocchio-style classifier. We identify the inductive biases ...
expand
Using conjunction of attribute values for classification
Mukund Deshpande, George Karypis
Pages: 356-364
doi>10.1145/584792.584851
Full text: PDFPDF

Advances in the efficient discovery of frequent itemsets have led to the development of a number of schemes that use frequent itemsets to aid developing accurate and efficient classifiers. These approaches use the frequent itemsets to generate a set ...
expand
Categorizing information objects from user access patterns
Mao Chen, Andrea LaPaugh, Jaswinder Pal Singh
Pages: 365-372
doi>10.1145/584792.584852
Full text: PDFPDF

Many web sites have dynamic information objects whose topics change over time. Classifying these objects automatically and promptly is a challenging and important problem for site masters. Traditional content-based and link structure based ...
expand
Knowledge and information management: is it possible to do interesting and important research, get funded, be useful and appreciated?
Maria Zemankova
Pages: 373-374
doi>10.1145/584792.584795
Full text: PDFPDF

The survey of the CIKM Call for Papers for the period 1998 - 2002 demonstrates that the CIKM organizers very accurately "identify challenging problems facing the development of future knowledge and information systems [in] applied and theoretical research" ...
expand
SESSION: Language models for information retrieval
Passage retrieval based on language models
Xiaoyong Liu, W. Bruce Croft
Pages: 375-382
doi>10.1145/584792.584854
Full text: PDFPDF

Previous research has shown that passage-level evidence can bring added benefits to document retrieval when documents are long or span different subject areas. Recent developments in language modeling approach to IR provided a new effective alternative ...
expand
Capturing term dependencies using a language model based on sentence trees
Ramesh Nallapati, James Allan
Pages: 383-390
doi>10.1145/584792.584855
Full text: PDFPDF

We describe a new probabilistic Sentence Tree Language Modeling approach that captures term dependency patterns in Topic Detection and Tracking's (TDT) Story Link Detection task. New features of the approach include modeling the syntactic structure of ...
expand
A language modeling framework for resource selection and results merging
Luo Si, Rong Jin, Jamie Callan, Paul Ogilvie
Pages: 391-397
doi>10.1145/584792.584856
Full text: PDFPDF

Statistical language models have been proposed recently for several information retrieval tasks, including the resource selection task in distributed information retrieval. This paper extends the language modeling approach to integrate resource selection, ...
expand
SESSION: Spatial search and moving objects
An efficient and effective algorithm for density biased sampling
Alexandros Nanopoulos, Yannis Manolopoulos, Yannis Theodoridis
Pages: 398-404
doi>10.1145/584792.584858
Full text: PDFPDF

In this paper we describe a new density-biased sampling algorithm. It exploits spatial indexes and the local density information they preserve, to provide improved quality of sampling result and fast access to elements of the dataset. It attains improved ...
expand
"GeoPlot": spatial data mining on video libraries
Jia-Yu Pan, Christos Faloutsos
Pages: 405-412
doi>10.1145/584792.584859
Full text: PDFPDF

Are "tornado" touchdowns related to "earthquakes"? How about to "floods", or to "hurricanes"? In Informedia [14], using a gazetteer on news video clips, we map news onto points on the globe and find correlations between sets of points. In this paper ...
expand
Trajectory queries and octagons in moving object databases
Hongjun Zhu, Jianwen Su, Oscar H. Ibarra
Pages: 413-421
doi>10.1145/584792.584860
Full text: PDFPDF

An important class of queries in moving object databases involves trajectories. We propose to divide trajectory predicates into topological and non-topological parts; extend the 9 intersection model of Egenhofer-Franzosa to a 3-step evaluation strategy ...
expand
SESSION: Music information retrieval
The effectiveness study of various music information retrieval approaches
Jia-Lien Hsu, Arbee L. P. Chen, Hung-Chen Chen, Ning-Han Liu
Pages: 422-429
doi>10.1145/584792.584862
Full text: PDFPDF

In this paper, we describe the Ultima project which aims to construct a platform for evaluating various approaches of music information retrieval. Two kinds of approaches are adopted in this project. These approaches differ in various aspects, such as ...
expand
Harmonic models for polyphonic music retrieval
Jeremy Pickens, Tim Crawford
Pages: 430-437
doi>10.1145/584792.584863
Full text: PDFPDF

Most work in the ad hoc music retrieval field has focused on the retrieval of monophonic documents using monophonic queries. Polyphony adds considerably more complexity. We present a method by which polyphonic music documents may be retrieved by polyphonic ...
expand
A singer identification technique for content-based classification of MP3 music objects
Chih-Chin Liu, Chuan-Sung Huang
Pages: 438-445
doi>10.1145/584792.584864
Full text: PDFPDF

As there is a growing amount of MP3 music data available on the Internet today, the problems related to music classification and content-based music retrieval are getting more attention recently. In this paper, we propose an approach to automatically ...
expand
SESSION: XML constraints and the semantic web
XKvalidator: a constraint validator for XML
Yi Chen, Susan B. Davidson, Yifeng Zheng
Pages: 446-452
doi>10.1145/584792.584866
Full text: PDFPDF

The role of XML in data exchange is evolving from one of merely conveying the structure of data to one that also conveys its semantics. In particular, several proposals for key and foreign key constraints have recently appeared, and aspects of these ...
expand
Discovering approximate keys in XML data
Gösta Grahne, Jianfei Zhu
Pages: 453-460
doi>10.1145/584792.584867
Full text: PDFPDF

Keys are very important in many aspects of data management, such as guiding query formulation, query optimization, indexing, etc. We consider the situation where an XML document does not come with key definitions, and we are interested in using data ...
expand
Information retrieval on the semantic web
Urvi Shah, Tim Finin, Anupam Joshi, R. Scott Cost, James Matfield
Pages: 461-468
doi>10.1145/584792.584868
Full text: PDFPDF

We describe an approach to retrieval of documents that contain of both free text and semantically enriched markup. In particular, we present the design and implementation prototype of a framework in which both documents and queries can be marked up with ...
expand
SESSION: Data streams and time-series
RHist: adaptive summarization over continuous data streams
Lin Qiao, Divyakant Agrawal, Amr El Abbadi
Pages: 469-476
doi>10.1145/584792.584870
Full text: PDFPDF

Maintaining approximate aggregates and summaries over data streams is crucial to handle the OLAP query workload that arises in applications, such as network monitoring and telecommunications. Furthermore, since the entire data is not available at all ...
expand
Efficient query monitoring using adaptive multiple key hashing
Kun-Lung Wu, Philip S. Yu
Pages: 477-484
doi>10.1145/584792.584871
Full text: PDFPDF

Monitoring continual queries or subscriptions is to determine the subset of all queries or subscriptions whose predicates match a given event. Predicates contain not only equality but also non-equality clauses. Event matching is usually accomplished ...
expand
Evaluating continuous nearest neighbor queries for streaming time series via pre-fetching
Like Gao, Zhengrong Yao, X. Sean Wang
Pages: 485-492
doi>10.1145/584792.584872
Full text: PDFPDF

For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions. Such a standing request is ...
expand
Mining temporal classes from time series data
Masahiro Motoyoshi, Takao Miura, Kohei Watanabe
Pages: 493-498
doi>10.1145/584792.584873
Full text: PDFPDF

In this investigation, we discuss how to mine Temporal Class Schemes to model a collection of time series data. From the viewpoint of temporal data mining, this problem can be seen as discretizing time series data or aggregating them. Also ...
expand
SESSION: Web clustering
Evaluating contents-link coupled web page clustering for web search results
Yitong Wang, Masaru Kitsuregawa
Pages: 499-506
doi>10.1145/584792.584875
Full text: PDFPDF

Clustering is currently one of the most crucial techniques for dealing (e.g. resources locating, information interpreting) with massive amount of heterogeneous information on the web. Unlike clustering in other fields, web page clustering separates unrelated ...
expand
Inferring hierarchical descriptions
Eric Glover, David M. Pennock, Steve Lawrence, Robert Krovetz
Pages: 507-514
doi>10.1145/584792.584876
Full text: PDFPDF

We create a statistical model for inferring hierarchical term relationships about a topic, given only a small set of example web pages on the topic, without prior knowledge of any hierarchical information. The model can utilize either the full text of ...
expand
Evaluation of hierarchical clustering algorithms for document datasets
Ying Zhao, George Karypis
Pages: 515-524
doi>10.1145/584792.584877
Full text: PDFPDF

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering ...
expand
Strategies for minimising errors in hierarchical web categorisation
Wahyu Wibowo, Hugh E. Williams
Pages: 525-531
doi>10.1145/584792.584878
Full text: PDFPDF

On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual ...
expand
SESSION: Information retrieval
Knowledge-based extraction of named entities
Jamie Callan, Teruko Mitamura
Pages: 532-537
doi>10.1145/584792.584880
Full text: PDFPDF

The usual approach to named-entity detection is to learn extraction rules that rely on linguistic, syntactic, or document format patterns that are consistent across a set of documents. However, when there is no consistency among documents, it may be ...
expand
Condorcet fusion for improved retrieval
Mark Montague, Javed A. Aslam
Pages: 538-548
doi>10.1145/584792.584881
Full text: PDFPDF

We present a new algorithm for improving retrieval results by combining document ranking functions: Condorcet-fuse. Beginning with one of the two major classes of voting procedures from Social Choice Theory, the Condorcet procedure, we apply a ...
expand
I/O-efficient techniques for computing pagerank
Yen-Yu Chen, Qingqing Gan, Torsten Suel
Pages: 549-557
doi>10.1145/584792.584882
Full text: PDFPDF

Over the last few years, most major search engines have integrated link-based ranking techniques in order to provide more accurate search results. One widely known approach is the Pagerank technique, which forms the basis of the Google ranking scheme, ...
expand
SESSION: Web search 2
Personalized web search by mapping user queries to categories
Fang Liu, Clement Yu, Weiyi Meng
Pages: 558-565
doi>10.1145/584792.584884
Full text: PDFPDF

Current web search engines are built to serve all users, independent of the needs of any individual user. Personalization of web search is to carry out retrieval for each user incorporating his/her interests. We propose a novel technique to map a user ...
expand
Using micro information units for internet search
Xiaoli Li, Tong-Heng Phang, Minqing Hu, Bing Liu
Pages: 566-573
doi>10.1145/584792.584885
Full text: PDFPDF

Internet search is one of the most important applications of the Web. A search engine takes the user's keywords to retrieve and to rank those pages that contain the keywords. One shortcoming of existing search techniques is that they do not give due ...
expand
Entropy-based link analysis for mining web informative structures
Hung-Yu Kao, Ming-Syan Chen, Shian-Hua Lin, Jan-Ming Ho
Pages: 574-581
doi>10.1145/584792.584886
Full text: PDFPDF

In this paper, we study the problem of mining the informative structure of a news Web site which consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., ...
expand
SESSION: Clustering algorithms
COOLCAT: an entropy-based algorithm for categorical clustering
Daniel Barbará, Yi Li, Julia Couto
Pages: 582-589
doi>10.1145/584792.584888
Full text: PDFPDF

In this paper we explore the connection between clustering categorical data and entropy: clusters of similar poi lower entropy than those of dissimilar ones. We use this connection to design an incremental heuristic algorithm, COOLCAT, which is capable ...
expand
FREM: fast and robust EM clustering for large data sets
Carlos Ordonez, Edward Omiecinski
Pages: 590-599
doi>10.1145/584792.584889
Full text: PDFPDF

Clustering is a fundamental Data Mining technique. This article presents an improved EM algorithm to cluster large data sets having high dimensionality, noise and zero variance problems. The algorithm incorporates improvements to increase the quality ...
expand
Alternatives to the k-means algorithm that find better clusterings
Greg Hamerly, Charles Elkan
Pages: 600-607
doi>10.1145/584792.584890
Full text: PDFPDF

We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k-harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic ...
expand
SESSION: Industry session 1: knowledge management and semantics
Thematic mapping - from unstructured documents to taxonomies
Christina Yip Chung, Raymond Lieu, Jinhui Liu, Alpha Luk, Jianchang Mao, Prabhakar Raghavan
Pages: 608-610
doi>10.1145/584792.584892
Full text: PDFPDF

Verity Inc. has developed a comprehensive suite of tools for accurately and efficiently organizing enterprise content which involves four basic steps: (i) creating taxonomies, (ii) building classification models, (iii) populating taxonomies with documents, ...
expand
Semantic technology applications for homeland security
D. Avant, M. Baum, C. Bertram, M. Fisher, A. Sheth, Y. Warke
Pages: 611-613
doi>10.1145/584792.584893
Full text: PDFPDF
Rule-based data quality
David Loshin
Pages: 614-616
doi>10.1145/584792.584894
Full text: PDFPDF

In the business intelligence/data warehouse user community, there is a growing confusion as to the difference between data cleansing and data quality. While many data cleansing products can help in applying data edits to name and address ...
expand
SESSION: Industry session 2: data mining and federated systems
Comparison of interestingness functions for learning web usage patterns
experimentation Huang, Nick Cercone, Aijun An
Pages: 617-620
doi>10.1145/584792.584896
Full text: PDFPDF

Livelink is a collaborative intranet, extranet and e-business application that enables employees and business partners of an organization to capture, share and reuse business information and knowledge. The usage of the Livelink software has been recorded ...
expand
The verity federated infrastructure
Kiam Choo, Rajat Mukherjee, Rami Smair, Wei Zhang
Pages: 621-621
doi>10.1145/584792.584897
Full text: PDFPDF

In the course of researching a subject, it is often necessary to submit the same search request to multiple heterogeneous information sources in order to (a) aggregate as much information as possible, and (b) integrate different aspects of the subject ...
expand
Automatically classifying database workloads
Said Elnaffar, Pat Martin, Randy Horman
Pages: 622-624
doi>10.1145/584792.584898
Full text: PDFPDF

The type of the workload on a database management system (DBMS) is a key consideration in tuning the system. Allocations for resources such as main memory can be very different depending on whether the workload type is Online Transaction Processing (OLTP) ...
expand
SESSION: Industry session 3: database performance and interface
A mapping mechanism to support bitmap index and other auxiliary structures on tables stored as primary B+-trees
Eugene Inseok Chong, Jagannathan Srinivasan, Souripriya Das, Chuck Freiwald, Aravind Yalamanchi, Mahesh Jagannath, Anh-Tuan Tran, Ramkumar Krishnan, Richard Jiang
Pages: 625-628
doi>10.1145/584792.584900
Full text: PDFPDF

Any auxiliary structure, such as a bitmap or a B+-tree index, that refers to rows of a table stored as a primary B+-tree (e.g., tables with clustered index in Microsoft SQL Server, or index-organized tables in Oracle) ...
expand
Using specification-driven concepts for distributed data management and dissemination
M. Brian Blake
Pages: 629-631
doi>10.1145/584792.584901
Full text: PDFPDF

At the MITRE Corporation-Center for Advanced Aviation System Development (CAASD), software engineers work closely with both analyst and domain experts to develop software simulations in the air traffic management domain. In this environment, software ...
expand
SESSION: Poster session
A new cache replacement algorithm for the integration of web caching and prefectching
Cheng-Yue Chang, Ming-Syan Chen
Pages: 632-634
doi>10.1145/584792.584903
Full text: PDFPDF

Web caching and Web prefetching are two important techniques to reduce the noticeable response time perceived by users. Note that by integrating Web caching and Web prefetching, these two techniques can complement each other since Web caching technique ...
expand
A syntactic approach for searching similarities within sentences
Federica Mandreoli, Riccardo Martoglia, Paolo Tiberio
Pages: 635-637
doi>10.1145/584792.584904
Full text: PDFPDF

Textual data is the main electronic form of knowledge representation. Sentences, meant as logic units of meaningful word sequences, can be considered its backbone. In this paper, we propose a solution based on a purely syntactic approach for searching ...
expand
A system for knowledge management in bioinformatics
Sudeshna Adak, Vishal S. Batra, Deo N. Bhardwaj, P. V. Kamesam, Pankaj Kankar, Manish P. Kurhekar, Biplav Srivastava
Pages: 638-641
doi>10.1145/584792.584905
Full text: PDFPDF

The emerging biochip technology has made it possible to simultaneously study expression (activity level) of thousands of genes or proteins in a single experiment in the laboratory. However, in order to extract relevant biological knowledge from the biochip ...
expand
An agent-based approach to knowledge management
Bin Yu, Munindar P. Singh
Pages: 642-644
doi>10.1145/584792.584906
Full text: PDFPDF

Traditional approaches to knowledge management are essentially limited to document management. However, much knowledge in organizations or communities resides in an informal social network and may be accessed only by asking the right people. This paper ...
expand
Features of documents relevant to task- and fact- oriented questions
Diane Kelly, Xiao-jun Yuan, Nicholas J. Belkin, Vanessa Murdock, W. Bruce Croft
Pages: 645-647
doi>10.1145/584792.584907
Full text: PDFPDF

We describe results from an ongoing project that considers question types and document features and their relationship to retrieval techniques. We examine eight document features from the top 25 documents retrieved from 74 questions and find that lists ...
expand
Data fusion with estimated weights
Shengli Wu, Fabio Crestani
Pages: 648-651
doi>10.1145/584792.584908
Full text: PDFPDF

This paper proposes an adptive approach for data fusion of information retrieval systems, which exploits estimated performances of all component input systems without relevance judgement or training. The estimation is conducted prior to the fusion but ...
expand
Discovering the representative of a search engine
King-Lup Liu, Clement Yu, Weiyi Meng
Pages: 652-654
doi>10.1145/584792.584909
Full text: PDFPDF

Given a large number of search engines on the Internet, it is difficult for a person to determine which search engines could serve his/her information needs. A common solution is to construct a metasearch engine on top of the search engines. Upon receiving ...
expand
Ginga: a self-adaptive query processing system
Henrique Paques, Ling Liu, Calton Pu
Pages: 655-658
doi>10.1145/584792.584910
Full text: PDFPDF
High-performing feature selection for text classification
Monica Rogati, Yiming Yang
Pages: 659-661
doi>10.1145/584792.584911
Full text: PDFPDF

This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian ...
expand
Index compression vs. retrieval time of inverted files for XML documents
Norbert Fuhr, Norbert Gövert
Pages: 662-664
doi>10.1145/584792.584912
Full text: PDFPDF

Query languages for retrieval of XML documents allow for conditions referring both to the content and the structure of documents. In this paper, we investigate two different approaches for reducing index space of inverted files for XML documents. First, ...
expand
Interactive methods for taxonomy editing and validation
Scott Spangler, Jeffrey Kreulen
Pages: 665-668
doi>10.1145/584792.584913
Full text: PDFPDF

Taxonomies are meaningful hierarchical categorizations of documents into topics reflecting the natural relationships between the documents and their business objectives. Improving the quality of these taxonomies and reducing the overall cost required ...
expand
Knowledge discovery from texts: a concept frame graph approach
Kanagasabai Rajaraman, Ah-Hwee Tan
Pages: 669-671
doi>10.1145/584792.584914
Full text: PDFPDF

We address the text content mining problem through a concept based framework by constructing a conceptual knowledge base and discovering knowledge therefrom. Defining a novel representation called the Concept Frame Graph (CFG), we propose a learning ...
expand
Knowledge discovery in patent databases
Konstantinos Markellos, Katerina Perdikuri, Penelope Markellou, Spiros Sirmakessis, George Mayritsakis, Athanasios Tsakalidis
Pages: 672-674
doi>10.1145/584792.584915
Full text: PDFPDF

In our days the business, scientific and personal databases are growing in an exponential rate. However, what is truly valuable is the knowledge that can be extracted from the stored data. Knowledge Discovery in patent databases was traditionally based ...
expand
Web-DL: an experience in building digital libraries from the web
Pável P. Calado, Altigran S. da Silva, Berthier Ribeiro-Neto, Alberto H. F. Laender, Juliano P. Lage, Davi C. Reis, Pablo A. Roberto, Monique V. Vieira, Marcos A. Gonçalves, Edward A. Fox
Pages: 675-677
doi>10.1145/584792.584916
Full text: PDFPDF

The Web contains a huge volume of information, almost all unstructured and, therefore, difficult to manage. In Digital Libraries, however, information is explicitly organized, described, and managed. In this paper, we propose an architecture that allows ...
expand
Mining coverage statistics for websource selection in a mediator
Zaiqing Nie, Ullas Nambiar, Sreelakshmi Vaddi, Subbarao Kambhampati
Pages: 678-680
doi>10.1145/584792.584917
Full text: PDFPDF

Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition there are no effective approaches for learning the needed statistics. ...
expand
Mining soft-matching association rules
Un Yong Nahm, Raymond J. Mooney
Pages: 681-683
doi>10.1145/584792.584918
Full text: PDFPDF

Variation and noise in database entries can prevent data mining algorithms, such as association rule mining, from discovering important regularities. In particular, textual fields can exhibit variation due to typographical errors, mispellings, abbreviations, ...
expand
Parallelizing the buckshot algorithm for efficient document clustering
Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder
Pages: 684-686
doi>10.1145/584792.584919
Full text: PDFPDF

We present a parallel implementation of the Buckshot document clustering algorithm. We demonstrate that this parallel approach is highly efficient both in terms of load balancing and minimization of communication. In a series of experiments using the ...
expand

Powered by The ACM Guide to Computing Literature


The ACM Digital Library is published by the Association for Computing Machinery. Copyright © 2016 ACM, Inc.
Terms of Usage   Privacy Policy   Code of Ethics   Contact Us

Useful downloads: Adobe Reader    QuickTime    Windows Media Player    Real Player
Did you know the ACM DL App is now available?
Did you know your Organization can subscribe to the ACM Digital Library?
The ACM Guide to Computing Literature
All Tags
Export Formats
 
 
Save to Binder