|
|
On scalable information retrieval systems |
| |
Ophir Frieder
|
|
Pages: 1-1 |
|
doi>10.1145/584792.584793 |
|
Full text: PDF
|
|
Implementing scalable information retrieval systems requires the design and development of efficient methods to ingest data from multiple sources, search and retrieve results from both English and foreign language document collections and from collections ...
Implementing scalable information retrieval systems requires the design and development of efficient methods to ingest data from multiple sources, search and retrieve results from both English and foreign language document collections and from collections comprising of multiple data types, harness high performance computer technology, and accurately answer user questions. Some recent efforts related to the development of scalable information retrieval systems are described. Particular emphasis is placed on those efforts that were adopted into commercial use. expand
|
|
|
SESSION: Pattern discovery and forecasting |
|
|
|
|
F4: large-scale automated forecasting using fractals |
| |
Deepayan Chakrabarti,
Christos Faloutsos
|
|
Pages: 2-9 |
|
doi>10.1145/584792.584797 |
|
Full text: PDF
|
|
Forecasting has attracted a lot of research interest, with very successful methods for periodic time series. Here, we propose a fast, automated method to do non-linear forecasting, for both periodic as well as chaotic time series. We use the technique ...
Forecasting has attracted a lot of research interest, with very successful methods for periodic time series. Here, we propose a fast, automated method to do non-linear forecasting, for both periodic as well as chaotic time series. We use the technique of delay coordinate embedding, which needs several parameters; our contribution is the automated way of setting these parameters, using the concept of `intrinsic dimensionality'. Our operational system has fast and scalable algorithms for preprocessing and, using R-trees, also has fast methods for forecasting. The result of this work is a black-box which, given a time series as input, finds the best parameter settings, and generates a prediction system. Tests on real and synthetic data show that our system achieves low error, while it can handle arbitrarily large datasets. expand
|
|
|
An iterative strategy for pattern discovery in high-dimensional data sets |
| |
Chun Tang,
Aidong Zhang
|
|
Pages: 10-17 |
|
doi>10.1145/584792.584798 |
|
Full text: PDF
|
|
High-dimensional data representation in which each data item (termed target object) is described by many features, is a necessary component of many applications. For example, in DNA microarrays, each sample (target object) is represented by thousands ...
High-dimensional data representation in which each data item (termed target object) is described by many features, is a necessary component of many applications. For example, in DNA microarrays, each sample (target object) is represented by thousands of genes as features. Pattern discovery of target objects presents interesting but also very challenging problems. The data sets are typically not task-specific, many features are irrelevant or redundant and should be pruned out or filtered for the purpose of classifying target objects to find empirical pattern. Uncertainty about which features are relevant makes it difficult to construct an informative feature space. This paper proposes an iterative strategy for pattern discovery in high-dimensional data sets. In this approach, the iterative process consists of two interactive components: discovering patterns within target objects and pruning irrelevant features. The performance of the proposed method with various real data sets is also illustrated. expand
|
|
|
Mining sequential patterns with constraints in large databases |
| |
Jian Pei,
Jiawei Han,
Wei Wang
|
|
Pages: 18-25 |
|
doi>10.1145/584792.584799 |
|
Full text: PDF
|
|
Constraints are essential for many sequential pattern mining applications. However, there is no systematic study on constraint-based sequential pattern mining. In this paper, we investigate this issue and point out that the framework developed ...
Constraints are essential for many sequential pattern mining applications. However, there is no systematic study on constraint-based sequential pattern mining. In this paper, we investigate this issue and point out that the framework developed for constrained frequent-pattern mining does not fit our missions well. An extended framework is developed based on a sequential pattern growth methodology. Our study shows that constraints can be effectively and efficiently pushed deep into sequential pattern mining under this new framework. Moreover, this framework can be extended to constraint-based structured pattern mining as well. expand
|
|
|
SESSION: Web search 1 |
|
|
|
|
Searching web databases by structuring keyword-based queries |
| |
Pável Calado,
Altigran S. da Silva,
Rodrigo C. Vieira,
Alberto H. F. Laender,
Berthier A. Ribeiro-Neto
|
|
Pages: 26-33 |
|
doi>10.1145/584792.584801 |
|
Full text: PDF
|
|
On-line information services have become widespread in the Web nowadays. However, Web users are non-specialized and have a great variety of interests. Thus, interfaces for Web databases must be simple and uniform. In this paper we present an approach, ...
On-line information services have become widespread in the Web nowadays. However, Web users are non-specialized and have a great variety of interests. Thus, interfaces for Web databases must be simple and uniform. In this paper we present an approach, based on Bayesian networks, for querying Web databases using keywords only. According to this approach, the user inputs a query through a simple search-box interface. From the input query, one or more plausible structured queries are derived and submitted to Web databases. The results are then retrieved and presented to the user as ranked answers. Our approach reduces the complexity of existing on-line interfaces and offers a solution to the problem of querying several distinct Web databases with a single interface. The applicability of the proposed approach was demonstrated by experimental results with 3 databases, obtained with a prototype search system that implements it. We have found that from 77% to 95% of the time, one of the top three resulting structured queries is the proper one. Further, when the user selects one of these three top queries for processing, the ranked answers present average precision figures from 60% to about 100%. expand
|
|
|
Topic-oriented collaborative crawling |
| |
Chiasen Chung,
Charles L. A. Clarke
|
|
Pages: 34-42 |
|
doi>10.1145/584792.584802 |
|
Full text: PDF
|
|
A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual ...
A major concern in the implementation of a distributed Web crawler is the choice of a strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented approach, in which the Web is partitioned into general subject areas with a crawler assigned to each. We examine design alternatives for a topic-oriented distributed crawler, including the creation of a Web page classifier for use in this context. The approach is compared experimentally with a hash-based partitioning, in which crawler assignments are determined by hash functions computed over URLs and page contents. The experimental evaluation demonstrates the feasibility of the approach, addressing issues of communication overhead, duplicate content detection, and page quality assessment. expand
|
|
|
Meta-recommendation systems: user-controlled integration of diverse recommendations |
| |
J. Ben Schafer,
Joseph A. Konstan,
John Riedl
|
|
Pages: 43-51 |
|
doi>10.1145/584792.584803 |
|
Full text: PDF
|
|
In a world where the number of choices can be overwhelming, recommender systems help users find and evaluate items of interest. They do so by connecting users with information regarding the content of recommended items or the opinions of other individuals. ...
In a world where the number of choices can be overwhelming, recommender systems help users find and evaluate items of interest. They do so by connecting users with information regarding the content of recommended items or the opinions of other individuals. Such systems have become powerful tools in domains such as electronic commerce, digital libraries, and knowledge management. In this paper, we address such systems and introduce a new class of recommender system called meta-recommenders. Meta-recommenders provide users with personalized control over the generation of a single recommendation list formed from a combination of rich data using multiple information sources and recommendation techniques. We discuss experiments conducted to aid in the design of interfaces for a meta-recommender in the domain of movies. We demonstrate that meta-recommendations fill a gap in the current design of recommender systems. Finally, we consider the challenges of building real-world, usable meta-recommenders across a variety of domains. expand
|
|
|
Removing redundancy and inconsistency in memory-based collaborative filtering |
| |
Kai Yu,
Xiaowei Xu,
Anton Schwaighofer,
Volker Tresp,
Hans-Peter Kriegel
|
|
Pages: 52-59 |
|
doi>10.1145/584792.584804 |
|
Full text: PDF
|
|
The application range of memory-based collaborative filtering (CF) is limited due to CF's high memory consumption and long runtime. The approach presented in this paper removes redundant and inconsistent instances (users) from the data. This paper aims ...
The application range of memory-based collaborative filtering (CF) is limited due to CF's high memory consumption and long runtime. The approach presented in this paper removes redundant and inconsistent instances (users) from the data. This paper aims to distinguish informative instances (users) from large raw user preference database and thus alleviate the memory and runtime cost of the widely used memory-based collaborative filtering (CF) algorithm. Our work shows that a satisfactory accuracy can be achieved by using only a small portion of the original data set, thereby alleviating the storage and runtime cost of the CF algorithm. In our approach, we consider instance selection as the problem of selecting informative data that increase the We begin by discussing the instance selection problem in a general sense that is to increase a posteriori probability of the optimal model by selecting informative data. We evaluate the empirical performance of our approach PF on two real-world data sets and attain very promisingpositive experimental results. The dData size and the prediction time are significantly reduced, while the prediction accuracy is on a par with almost the same as the results achieved by using the complete database. expand
|
|
|
SESSION: Data warehousing and OLAP |
|
|
|
|
Analysis of pre-computed partition top method for range top-k queries in OLAP data cubes |
| |
Zheng Xuan Loh,
Tok Wang Ling,
Chuan Heng Ang,
Sin Yeung Lee
|
|
Pages: 60-67 |
|
doi>10.1145/584792.584806 |
|
Full text: PDF
|
|
In decision support systems, having knowledge on the top k values is more informative and crucial than the maximum value. Unfortunately, the naive method involves high computational cost and the existing methods for range-max query are inefficient ...
In decision support systems, having knowledge on the top k values is more informative and crucial than the maximum value. Unfortunately, the naive method involves high computational cost and the existing methods for range-max query are inefficient if applied directly. In this paper, we propose a Pre-computed Partition Top method (PPT) to partition the data cube and pre-store a number of top values for improving query performance. The main focus of this study is to find the optimum values for two parameters, i.e., the partition factor (b) and the number of pre-stored values (r), through analytical approach. A cost function based on Poisson distribution is used for the analysis. The analytical results obtained are verified against simulation results. It is shown that the PPT method outperforms other alternative methods significantly when proper b and r are used. expand
|
|
|
Batch data warehouse maintenance in dynamic environments |
| |
Bin Liu,
Songting Chen,
Elke A. Rundensteiner
|
|
Pages: 68-75 |
|
doi>10.1145/584792.584807 |
|
Full text: PDF
|
|
Data warehouse view maintenance is an important issue due to the growing use of warehouse technology for information integration and data analysis. Given the dynamic nature of modern distributed environments, both data updates and schema changes are ...
Data warehouse view maintenance is an important issue due to the growing use of warehouse technology for information integration and data analysis. Given the dynamic nature of modern distributed environments, both data updates and schema changes are likely to occur in different data sources. In applications that the real-time refreshment of data warehouse extent under source changes is not critical, the source updates are usually maintained in a batch fashion to reduce the maintenance overhead. However, most prior work can only deal with batch source data updates. In this paper, we provide a solution strategy that is capable of batching both source data updates and schema changes. We propose techniques to first preprocess the initial source updates to summarize delta changes for each source. We then design a view adaptation algorithm to adapt the warehouse view under these delta changes. We have implemented our solutions and incorporated into an existing data warehouse prototype system. The experimental studies demonstrate excellent performance achievable by our batch techniques. expand
|
|
|
A fast filtering scheme for large database cleansing |
| |
Sam Y. Sung,
Zhao Li,
Peng Sun
|
|
Pages: 76-83 |
|
doi>10.1145/584792.584808 |
|
Full text: PDF
|
|
Existing data cleansing methods are costly and will take very long time to cleanse large databases. Since large databases are common nowadays, it is necessary to reduce the cleansing time. Data cleansing consists of two main components, detection method ...
Existing data cleansing methods are costly and will take very long time to cleanse large databases. Since large databases are common nowadays, it is necessary to reduce the cleansing time. Data cleansing consists of two main components, detection method and comparison method. In this paper, we first propose a simple and fast comparison method, TI-Similarity, which reduces the time for each comparison. Based on TI-Similarity, we propose a new detection method, RAR, to further reduce the number of comparisons. With RAR and TI-Similarity, our new approach for cleansing large databases is composed of two processes: Filtering process and Pruning process. In filtering process, a fast scan on the database is carried out with RAR and TI-Similarity. This process guarantees the detection of potential duplicate records but may introduce false positives. In pruning process, the duplicate result from the filtering process is pruned to eliminate the false positives using more trustworthy comparison methods. The performance study shows that our approach is efficient and scalable for cleansing large databases, and is about an order of magnitude faster than existing cleansing methods. expand
|
|
|
Semantic-based delivery of OLAP summary tables in wireless environments |
| |
Mohamed A. Sharaf,
Panos K. Chrysanthis
|
|
Pages: 84-92 |
|
doi>10.1145/584792.584809 |
|
Full text: PDF
|
|
With the rapid growth in mobile and wireless technologies and the availability, pervasiveness and cost effectiveness of wireless networks, mobile computers are quickly becoming the normal front-end devices for accessing enterprise data. In this paper, ...
With the rapid growth in mobile and wireless technologies and the availability, pervasiveness and cost effectiveness of wireless networks, mobile computers are quickly becoming the normal front-end devices for accessing enterprise data. In this paper, we are addressing the issue of efficient delivery of business decision support data in the form of summary tables to mobile clients equipped with OLAP front-end tools. Towards this, we propose a new on-demand scheduling algorithm, called SBS, that exploits both the derivation semantics among OLAP summary tables and the mobile clients' capabilities of executing simple SQL queries. It maximizes the aggregated data sharing between clients and reduces the broadcast length compared to the already existing techniques. The degree of aggregation can be tuned to control the tradeoff between access time and energy consumption. Further, the proposed scheme adapts well to different request rates, access patterns and data distributions. The algorithm effectiveness with respect to access time and power consumption is evaluated using simulation. expand
|
|
|
Future directions in data mining: streams, networks, self-similarity and power laws |
| |
Christos Faloutsos
|
|
Pages: 93-93 |
|
doi>10.1145/584792.584794 |
|
Full text: PDF
|
|
How to spot abnormalities in a stream of temperature data from a sensor? Or from a network of sensors? How does the Internet look like? Are there 'abnormal' sub-graphs in a given social network, possibly indicating, e.g., money-laundering rings?We present ...
How to spot abnormalities in a stream of temperature data from a sensor? Or from a network of sensors? How does the Internet look like? Are there 'abnormal' sub-graphs in a given social network, possibly indicating, e.g., money-laundering rings?We present some recent work and list many remaining challenges for these two fascinating issues in data mining, namely, streams and networks. Streams appear in numerous settings, in the form of, e.g., temperature readings, road traffic data, series of video frames for surveillance, patient physiological data. In all these settings, we want to equip the sensors with nimble, but powerful enough algorithms to look for patterns and abnormalities,<ol> (a) on a semi-infinite stream,(b) using finite memory, and (c) without human intervention.</ol.For networks, the applications are also numerous: social networks recording who knows/calls/emails whom; the Internet itself, as well as the Web, with routers and links, or pages and hyper-links; the genes and how they are related; customers and products they buy. In fact, any "many-to-many" database relationship eventually leads to a graph/network. In all these settings we want to find patterns and 'abnormalities'; the most central/important nodes; we also want to predict how the network will evolve; and we want to tackle huge graphs, with millions or billions of nodes and edges.As a promising direction towards these problems, we present some surprising tools from the theory of fractals, self-similarity and power laws. We show how the 'intrinsic' or 'fractal' dimension can help us find patterns, when traditional tools and assumptions fail. We show that self-similarity and power laws models work well in an impressive variety of settings, including real, bursty disk and web traffic; skewed distributions of click-streams; and multiple, real Internet graphs. expand
|
|
|
SESSION: Image similarity search systems |
|
|
|
|
Symbolic photograph content-based retrieval |
| |
Philippe Mulhem,
Joo Hwee Lim
|
|
Pages: 94-101 |
|
doi>10.1145/584792.584811 |
|
Full text: PDF
|
|
Photograph retrieval systems face the difficulty to deal with the different ways to apprehend the content of images. We consider and demonstrate here the use of multiple index representations of photographs to achieve effective retrieval. The use of ...
Photograph retrieval systems face the difficulty to deal with the different ways to apprehend the content of images. We consider and demonstrate here the use of multiple index representations of photographs to achieve effective retrieval. The use of multiple indexes allows integration of the complementary strengths of different indexing and retrieval models. The proposed representation supports multiple labels for regions and attributes, and handles inferences and relationships. We define links between indexing levels and the related query modes. The experiment conducted on 2400 home photographs shows the behavior of the multiple indexing levels during retrieval. expand
|
|
|
A compact and efficient image retrieval approach based on border/interior pixel classification |
| |
Renato O. Stehling,
Mario A. Nascimento,
Alexandre X. Falcão
|
|
Pages: 102-109 |
|
doi>10.1145/584792.584812 |
|
Full text: PDF
|
|
This paper presents \bic (Border/Interior pixel Classification), a compact and efficient CBIR approach suitable for broad image domains. It has three main components: (1) a simple and powerful image analysis algorithm that classifies ...
This paper presents \bic (Border/Interior pixel Classification), a compact and efficient CBIR approach suitable for broad image domains. It has three main components: (1) a simple and powerful image analysis algorithm that classifies image pixels as either border or interior, (2) a new logarithmic distance (dLog) for comparing histograms, and (3) a compact representation for the visual features extracted from images. Experimental results show that the BIC approach is consistently more compact, more efficient and more effective than state-of-the-art CBIR approaches based on sophisticated image analysis algorithms and complex distance functions. It was also observed that the dLog distance function has two main advantages over vectorial distances (e.g., L1): (1) it is able to increase substantially the effectiveness of (several) histogram-based CBIR approaches and, at the same time, (2) it reduces by 50% the space requirement to represent a histogram. expand
|
|
|
Vulnerabilities in similarity search based systems |
| |
Ali Saman Tosun,
Hakan Ferhatosmanoglu
|
|
Pages: 110-117 |
|
doi>10.1145/584792.584813 |
|
Full text: PDF
|
|
Similarity based queries are common in several modern database applications, such as multimedia, scientific, and biomedical databases. In most of these systems, database responds with the tuple with the closest match according to some metric. In this ...
Similarity based queries are common in several modern database applications, such as multimedia, scientific, and biomedical databases. In most of these systems, database responds with the tuple with the closest match according to some metric. In this paper we investigate some important security issues related to similarity search in databases. We investigate the vulnerability of such systems against users who try to copy the database by sending automated queries. We analyze two models for similarity search, namely reply model and score model. Reply model responds with the tuple with best match and score model responds with only the score of similarity search. For these models we analyze possible ways of attacks and strategies that can be used to detect attacks. Our analysis shows that in score model it is much easier to plug the vulnerabilities than in reply model. Sophisticated attacks can easily be used in reply model and the database is limited in capability to prevent such attacks. expand
|
|
|
SESSION: XML query processing |
|
|
|
|
Efficient evaluation of multiple queries on streaming XML data |
| |
Mong Li Lee,
Boon Chin Chua,
Wynne Hsu,
Kian-Lee Tan
|
|
Pages: 118-125 |
|
doi>10.1145/584792.584815 |
|
Full text: PDF
|
|
Traditionally, XML documents are processed at where they are stored. This allows the query processor to exploit pre-computed data structures (e.g., index) to retrieve the desired data efficiently. However, this mode of processing is not suitable for ...
Traditionally, XML documents are processed at where they are stored. This allows the query processor to exploit pre-computed data structures (e.g., index) to retrieve the desired data efficiently. However, this mode of processing is not suitable for many applications where the documents are frequently updated. In such situations, efficient evaluation of multiple queries over streaming XML documents becomes important. This paper introduces a new operator, mqX-scan, which efficiently evaluates multiple queries with a single pass on streaming XML data. To facilitate matching, mqX-scan utilizes templates containing paths that have been traversed to match regular path expression patterns in a pool of queries. Results of the experiments demonstrate the efficiency and scalability of the mqX-scan operator. expand
|
|
|
Query processing of streamed XML data |
| |
Leonidas Fegaras,
David Levine,
Sujoe Bose,
Vamsi Chaluvadi
|
|
Pages: 126-133 |
|
doi>10.1145/584792.584816 |
|
Full text: PDF
|
|
We are addressing the efficient processing of continuous XML streams, in which the server broadcasts XML data to multiple clients concurrently through a multicast data stream, while each client is fully responsible for processing the stream. In our framework, ...
We are addressing the efficient processing of continuous XML streams, in which the server broadcasts XML data to multiple clients concurrently through a multicast data stream, while each client is fully responsible for processing the stream. In our framework, a server may disseminate XML fragments from multiple documents in the same stream, can repeat or replace fragments, and can introduce new fragments or delete invalid ones. A client uses a light-weight database based on our proposed XML algebra to cache stream data and to evaluate XML queries against these data. The synchronization between clients and servers is achieved through annotations and punctuations transmitted along with the data streams. We are presenting a framework for processing XML queries in XQuery form over continuous XML streams. Our framework is based on a novel XML algebra and a new algebraic optimization framework based on query decorrelation, which is essential for non-blocking stream processing. expand
|
|
|
Multi-level operator combination in XML query processing |
| |
Shurug Al-Khalifa,
H. V. Jagadish
|
|
Pages: 134-141 |
|
doi>10.1145/584792.584817 |
|
Full text: PDF
|
|
A core set of efficient access methods is central to the development of any database system. In the context of an XML database, there has been considerable effort devoted to defining a good set of primitive operators and inventing efficient access methods ...
A core set of efficient access methods is central to the development of any database system. In the context of an XML database, there has been considerable effort devoted to defining a good set of primitive operators and inventing efficient access methods for each individual operator. These primitive operators have been defined either at the macro-level (using a "pattern tree" to specify a selection, for example) or at the micro-level (using multiple explicit containment joins to instantiate a single XPath expression).In this paper we argue that it is valuable to consider operations at each level. We do this through a study of operator merging: the development of a new access method to implement a combination of two or more primitive operators. It is frequently the case that access methods for merged operators are superior to a pipelined execution of separate access methods for each operator. We show operator merging to be valuable at both the micro-level and the macro-level. Furthermore, we show that the corresponding merged operators are hard to reason with at the other level.Specifically, we consider the influence of projections and set operations on pattern-based selections and containment joins. We show, through both analysis and extensive experimentation, the benefits of considering these operations all together. Even though our experimental verification is only with a native XML database, we have reason to believe that our results apply equally to RDBMS-based XML query engines. expand
|
|
|
SESSION: XML transactions |
|
|
|
|
XMLTM: efficient transaction management for XML documents |
| |
Torsten Grabs,
Klemens Böhm,
Hans-Jörg Schek
|
|
Pages: 142-152 |
|
doi>10.1145/584792.584819 |
|
Full text: PDF
|
|
A common approach to storage and retrieval of XML documents is to store them in a database, together with materialized views on their content. The advantage over "native" XML storage managers seems to be that transactions and concurrency are for free, ...
A common approach to storage and retrieval of XML documents is to store them in a database, together with materialized views on their content. The advantage over "native" XML storage managers seems to be that transactions and concurrency are for free, next to other benefits. But a closer look and preliminary experiments reveal that this results in poor performance of concurrent queries and updates. The reason is that database lock contention hinders parallelism unnecessarily. We therefore investigate concurrency control at the semantic, i.e., XML level and describe a respective transaction manager XMLTM. It features a new locking protocol DGLOCK. It generalizes the protocol for locking on directed acyclic graphs by adding simple predicate locking on the content of elements, e.g., on their text. Instead of using the original XML documents, we propose to take advantage of an abstraction of the XML document collection known as DataGuides. XMLTM allows to run XML processing at the underlying database at low ANSI isolation degrees and to release database locks early without sacrificing correctness in this setting. We have built a complete prototype system that is implemented on top of the XML Extender for IBM DB2. Our evaluation shows that our approach consistently yields performance improvements by an order of magnitude. We stress that our approach can also be implemented within a native XML storage manager, and we expect even better performance. expand
|
|
|
Efficient synchronization for mobile XML data |
| |
Franky Lam,
Nicole Lam,
Raymond Wong
|
|
Pages: 153-160 |
|
doi>10.1145/584792.584820 |
|
Full text: PDF
|
|
Many handheld applications receive data from a primary database server and operate in an intermittently connected environment these days. They maintain data consistency with data sources through sychronization. In certain applications such as sales force ...
Many handheld applications receive data from a primary database server and operate in an intermittently connected environment these days. They maintain data consistency with data sources through sychronization. In certain applications such as sales force automation, it is highly desirable if updates on the data source can be reflected at the handheld applications immediately. This paper proposes an efficient method to synchronize XML data on multiple mobile devices. Each device retrieves and caches a local copy of data from the database source based on a regular path expression. These local copies may be overlapping or disjoint with each other. An efficient mechanism is proposed to find all the disjoint copies to avoid unnecessary synchronizations. Each update to the data source will then be checked to identify all handheld applications which are affected by the update. Communication costs can be further reduced by eliminating the forwarding of unnecessary operations to groups of mobile clients. expand
|
|
|
An object-oriented extension of XML for autonomous web applications |
| |
Hasan M. Jamil,
Giovanni A. Modica
|
|
Pages: 161-168 |
|
doi>10.1145/584792.584821 |
|
Full text: PDF
|
|
While the idea of extending XML to include object-oriented features has been gaining popularity in general, the potential of inheritance in document design has not been well recognized in contemporary research. In this paper we demonstrate that XML with ...
While the idea of extending XML to include object-oriented features has been gaining popularity in general, the potential of inheritance in document design has not been well recognized in contemporary research. In this paper we demonstrate that XML with dynamic inheritance aids better document designs and decreased management overheads and support increased autonomy. As an extended application, we point out that dynamic inheritance also helps effective automated web portal and ontology designs.We present an object-oriented extension to the language of XML to include dynamic inheritance and describe a middle layer that implements our system. We explain our system with several practical examples. expand
|
|
|
SESSION: Caching |
|
|
|
|
Efficient prediction of web accesses on a proxy server |
| |
Wenwu Lou,
Hongjun Lu
|
|
Pages: 169-176 |
|
doi>10.1145/584792.584823 |
|
Full text: PDF
|
|
Web access prediction is an active research topic with many applications. Various approaches have been proposed for Web access prediction in the domain of individual Web servers but they have to be tailored to the domain of proxy servers to satisfy its ...
Web access prediction is an active research topic with many applications. Various approaches have been proposed for Web access prediction in the domain of individual Web servers but they have to be tailored to the domain of proxy servers to satisfy its special requirements in prediction efficiency and scalability. In this paper, the design and implementation of proxy-based prediction service (PPS) is presented. For prediction efficiency, PPS applies a new prediction scheme which employs a two-layer navigation model to capture both inter-site and intra-site access patterns, incorporated with a bottom-up prediction mechanism that exploits reference locality in proxy logs. For system scalability, PPS manages the navigation model in disk database and adopts a predictive cache replacement strategy for data shipping between the model database and cache. We show the superiority of our prediction scheme over existing approaches and validate our model management and caching strategies, with a detailed performance study using real-world data. expand
|
|
|
A self-managing data cache for edge-of-network web applications |
| |
Khalil Amiri,
Sanghyun Park,
Renu Tewari
|
|
Pages: 177-185 |
|
doi>10.1145/584792.584824 |
|
Full text: PDF
|
|
Database caching at proxy servers enables dynamic content to be generated at the edge of the network, thereby improving the scalability and response time of web applications. The scale of deployment of edge servers coupled with the rising costs of their ...
Database caching at proxy servers enables dynamic content to be generated at the edge of the network, thereby improving the scalability and response time of web applications. The scale of deployment of edge servers coupled with the rising costs of their administration demand that such caching middleware be adaptive and self-managing. To achieve this, a cache must be dynamically populated and pruned based on the application query stream and access pattern. In this paper, we describe such a cache which maintains a large number of materialized views of previous query results. Cached "views" share physical storage to avoid redundancy, and are usually added and evicted dynamically to adapt to the current workload and to available resources. These two properties of large scale (large number of cached views) and overlapping storage introduce several challenges to query matching and storage management which are not addressed by traditional approaches. In this paper, we describe an edge data cache architecture with a flexible query matching algorithm and a novel storage management policy which work well in such an environment. We perform an evaluation of a prototype of such an architecture using the TPC-W benchmark and find that it reduces query response times by up to 75%, while reducing network and server load. expand
|
|
|
Cooperative caching by mobile clients in push-based information systems |
| |
Takahiro Hara
|
|
Pages: 186-193 |
|
doi>10.1145/584792.584825 |
|
Full text: PDF
|
|
Recent advances in computer and wireless communication technologies have increased interest in push-based information systems in which a server repeatedly broadcasts data to clients through a broadband channel. In this paper, assuming an environment ...
Recent advances in computer and wireless communication technologies have increased interest in push-based information systems in which a server repeatedly broadcasts data to clients through a broadband channel. In this paper, assuming an environment where clients in push-based information systems construct ad hoc networks, we propose three caching strategies in which clients cooperatively cache broadcast data items. These strategies shorten the average response time for data access by replacing cached items based on their access frequencies, the network topology, and the time remaining until each item is broadcast next. We also show the results of simulation experiments conducted to evaluate the performance of our proposed strategies. expand
|
|
|
SESSION: Information extraction and text segmentation |
|
|
|
|
AuGEAS: authoritativeness grading, estimation, and sorting |
| |
Ayman Farahat,
Geoff Nunberg,
Francine Chen
|
|
Pages: 194-202 |
|
doi>10.1145/584792.584827 |
|
Full text: PDF
|
|
When searching for content in in a large heterogeneous document collections like the World Wide Web it is not easy to know which documents provide reliable authoritative information about a subject. The problem is particularly pointed as it concerns ...
When searching for content in in a large heterogeneous document collections like the World Wide Web it is not easy to know which documents provide reliable authoritative information about a subject. The problem is particularly pointed as it concerns content search for "high-value" informational needs such as retrieving medical information, where the cost of error may be high. In this paper, a method is described for estimating the authoritativeness of a document based on textual, non-topical cues. This method is complementary to estimates of authoritativeness based on link structure, such as the PageRank and HITS algorithms. This method is particularly suited to "high-value" content search where the user is interested in searching for information about a specific topic. A method for combining textual estimates of authoritativeness with link analysis is also presented. The types of textual cues to authoritativeness that are easily computed and utilized by our method are described, as well as the method used to select a subset of cues to increase the computation speed. Methods for applying authoritativeness estimates to re-ranking documents returned from search engines, combining textual authoritativeness with social authority, and use in query expansion are also presented. By combining textual authority with link analysis, a more complete and robust estimate can be made of a document's authoritativeness. expand
|
|
|
Structural extraction from visual layout of documents |
| |
Binyamin Rosenfeld,
Ronen Feldman,
Yonatan Aumann
|
|
Pages: 203-210 |
|
doi>10.1145/584792.584828 |
|
Full text: PDF
|
|
Most information extraction systems focus on the textual content of the documents. They treat documents as sequences or of words, disregarding the physical and typographical layout of the information.. While this strategy helps in focusing the extraction ...
Most information extraction systems focus on the textual content of the documents. They treat documents as sequences or of words, disregarding the physical and typographical layout of the information.. While this strategy helps in focusing the extraction process on the key semantic content of the document, much valuable information can also be derived form the document physical appearance. Often, fonts, physical positioning and other graphical characteristics are used to provide additional context to the information. This information is lost with pure-text analysis. In this paper we describe a general procedure for structural extraction, which allows for automatic extraction of entities from the document based on their visual characteristics and relative position in the document layout. Our structural extraction procedure is a learning algorithm, which knows how to automatically generalizes from examples. The procedure is a general one, applicable to any document format with visual and typographical information. We also then describe a specific implementation of the procedure to PDF documents, called PES (PDF Extraction System). PES works with PDF documents and is able to extract such fields such as Author(s), Title, Date, etc. with very high accuracy. expand
|
|
|
Topic-based document segmentation with probabilistic latent semantic analysis |
| |
Thorsten Brants,
Francine Chen,
Ioannis Tsochantaridis
|
|
Pages: 211-218 |
|
doi>10.1145/584792.584829 |
|
Full text: PDF
|
|
This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) ...
This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems. expand
|
|
|
SESSION: Sequence similarity search and access methods |
|
|
|
|
How to improve the pruning ability of dynamic metric access methods |
| |
Caetano Traina, Jr.,
Agma Traina,
Roberto Santos Filho,
Christos Faloutsos
|
|
Pages: 219-226 |
|
doi>10.1145/584792.584831 |
|
Full text: PDF
|
|
Complex data retrieval is accelerated using index structures, which organize the data in order to prune comparisons between data during queries. In metric spaces, comparison operations can be specially expensive, so the pruning ability of indexing methods ...
Complex data retrieval is accelerated using index structures, which organize the data in order to prune comparisons between data during queries. In metric spaces, comparison operations can be specially expensive, so the pruning ability of indexing methods turns out to be specially meaningful. This paper shows how to measure the pruning power of metric access methods, and defines a new measurement, called "prunability," which indicates how well a pruning technique carries out the task of cutting down distance calculations at each tree level. It also presents a new dynamic access method, aiming to minimize the number of distance calculations required to answer similarity queries. We show that this novel structure is up to 3 times faster and requires less than 25% distance calculations to answer similarity queries, as compared to existing methods. This gain in performance is achieved by taking advantage of a set of global representatives. Although our technique uses multiple representatives, the index structure still remains dynamic and balanced. expand
|
|
|
On the efficient evaluation of relaxed queries in biological databases |
| |
Yangjun Chen,
Duren Che,
Karl Aberer
|
|
Pages: 227-236 |
|
doi>10.1145/584792.584832 |
|
Full text: PDF
|
|
In this paper, a new technique is developed to support the query relaxation in biological databases. Query relaxation is required due to the fact that queries tend not to be expressed exactly by the users, especially in scientific databases such as biological ...
In this paper, a new technique is developed to support the query relaxation in biological databases. Query relaxation is required due to the fact that queries tend not to be expressed exactly by the users, especially in scientific databases such as biological databases, in which complex domain knowledge is heavily involved. To treat this problem, we propose the concept of the so-called fuzzy equivalence classes to capture important kinds of domain knowledge that is used to relax queries. This concept is further integrated with the canonical techniques for pattern searching such as the position tree and automaton theory. As a result, fuzzy queries produced through relaxation can be efficiently evaluated. This method has been successfully utilized in a practical biological database - the GPCRDB. expand
|
|
|
Similarity based retrieval from sequence databases using automata as queries |
| |
A. Prasad Sistla,
Tao Hu,
Vikas Chowdhry
|
|
Pages: 237-244 |
|
doi>10.1145/584792.584833 |
|
Full text: PDF
|
|
Similarity based retrieval from sequence databases is of importance in many applications such as time-series, video and textual databases. In this paper, automata based formalisms are introduced for specifying queries over such databases. Various measures ...
Similarity based retrieval from sequence databases is of importance in many applications such as time-series, video and textual databases. In this paper, automata based formalisms are introduced for specifying queries over such databases. Various measures defining the distance of a database sequence from an automaton are defined. Efficient methods for similarity based retrieval are presented for each of the distance measures. These methods answer nearest neighbor queries (i.e. retrieval of k closest subsequences), or range queries (i.e., retrieval of all sequences with in a given distance). expand
|
|
|
SESSION: Information retrieval models |
|
|
|
|
Detecting similar documents using salient terms |
| |
James W. Cooper,
Anni R. Coden,
Eric W. Brown
|
|
Pages: 245-251 |
|
doi>10.1145/584792.584835 |
|
Full text: PDF
|
|
We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. ...
We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. We compare this to the shingles approach. expand
|
|
|
The role of variance in term weighting for probabilistic information retrieval |
| |
Warren R. Greiff,
William T. Morgan,
Jay M. Ponte
|
|
Pages: 252-259 |
|
doi>10.1145/584792.584836 |
|
Full text: PDF
|
|
In probabilistic approaches to information retrieval, the occurrence of a query term in a document contributes to the probability that the document will be judged relevant. It is typically assumed that the weight assigned to a query term should be based ...
In probabilistic approaches to information retrieval, the occurrence of a query term in a document contributes to the probability that the document will be judged relevant. It is typically assumed that the weight assigned to a query term should be based on the expected value of that contribution. In this paper we show that the degree to which observable document features such as term frequencies are expected to vary is also important. By means of stochastic simulation, we show that increased variance results in degraded retrieval performance. We further show that by decreasing term weights in the presence of variance, this degradation can be reduced. Hence, probabilistic models of information retrieval must take into account not only the expected value of a query term's contribution but also the variance of document features. expand
|
|
|
Inferring query models by computing information flow |
| |
P. D. Bruza,
D. Song
|
|
Pages: 260-269 |
|
doi>10.1145/584792.584837 |
|
Full text: PDF
|
|
The language modelling approach to information retrieval can also be used to compute query models. A query model can be envisaged as an expansion of an initial query. The more prominent query models in the literature have a probabilistic basis. This ...
The language modelling approach to information retrieval can also be used to compute query models. A query model can be envisaged as an expansion of an initial query. The more prominent query models in the literature have a probabilistic basis. This paper introduces an alternative, non-probabilistic approach to query modelling whereby the strength of information flow is computed between a query Q and a term w. Information flow is a reflection of how strongly w is informationally contained within the query Q. The information flow model is based on Hyperspace Analogue to Language (HAL) vector representations, which reflects the lexical co-occurrence information of terms. Research from cognitive science has demonstrated the cognitive compatibility of HAL representations with human processing. Query models computed from TREC queries by HAL-based information flow are compared experimentally with two probabilistic query language models. Experimental results are provided showing the HAL-based information flow model be superior to query models computed via Markov chains, and seems to be as effective as a probabilistically motivated relevance model. expand
|
|
|
SESSION: XML schemas: integration and translation |
|
|
|
|
Logical and physical support for heterogeneous data |
| |
Sihem Amer-Yahia,
Mary Fernández,
Rick Greer,
Divesh Srivastava
|
|
Pages: 270-281 |
|
doi>10.1145/584792.584839 |
|
Full text: PDF
|
|
Heterogeneity arises naturally in virtually all real-world data. This paper presents evolutionary extensions to a relational database system for supporting three classes of data heterogeneity: variational, structural and annotational heterogeneities. ...
Heterogeneity arises naturally in virtually all real-world data. This paper presents evolutionary extensions to a relational database system for supporting three classes of data heterogeneity: variational, structural and annotational heterogeneities. We define these classes and show the impact of these new features on data storage, data-access mechanisms, and the data-description language. Since XML is an important source of heterogeneity, we describe how the system automatically utilizes these new features when storing XML documents. expand
|
|
|
NeT & CoT: translating relational schemas to XML schemas using semantic constraints |
| |
Dongwon Lee,
Murali Mani,
Frank Chiu,
Wesley W. Chu
|
|
Pages: 282-291 |
|
doi>10.1145/584792.584840 |
|
Full text: PDF
|
|
Two algorithms, called NeT and CoT, to translate relational schemas to XML schemas using various semantic constraints are presented. The XML schema representation we use is a language-independent formalism named XSchema, that is both precise and concise. ...
Two algorithms, called NeT and CoT, to translate relational schemas to XML schemas using various semantic constraints are presented. The XML schema representation we use is a language-independent formalism named XSchema, that is both precise and concise. A given XSchema can be mapped to a schema in any of the existing XML schema language proposals. Our proposed algorithms have the following characteristics: (1) NeT derives a nested structure from a flat relational model by repeatedly applying the nest operator on each table so that the resulting XML schema becomes hierarchical, and (2) CoT considers not only the structure of relational schemas, but also semantic constraints such as inclusion dependencies during the translation. It takes as input a relational schema where multiple tables are interconnected through inclusion dependencies and converts it into a good XSchema. To validate our proposals, we present experimental results using both real schemas from the UCI repository and synthetic schemas from TPC-H. expand
|
|
|
XClust: clustering XML schemas for effective integration |
| |
Mong Li Lee,
Liang Huai Yang,
Wynne Hsu,
Xia Yang
|
|
Pages: 292-299 |
|
doi>10.1145/584792.584841 |
|
Full text: PDF
|
|
It is increasingly important to develop scalable integration techniques for the growing number of XML data sources. A practical starting point for the integration of large numbers of Document Type Definitions (DTDs) of XML sources would be to first find ...
It is increasingly important to develop scalable integration techniques for the growing number of XML data sources. A practical starting point for the integration of large numbers of Document Type Definitions (DTDs) of XML sources would be to first find clusters of DTDs that are similar in structure and semantics. Reconciling similar DTDs within such a cluster will be an easier task than reconciling DTDs that are different in structure and semantics as the latter would involve more restructuring. We introduce XClust, a novel integration strategy that involves the clustering of DTDs. A matching algorithm based on the semantics, immediate descendents and leaf-context similarity of DTD elements is developed. Our experiments to integrate real world DTDs demonstrate the effectiveness of the XClust approach. expand
|
|
|
A local search mechanism for peer-to-peer networks |
| |
Vana Kalogeraki,
Dimitrios Gunopulos,
D. Zeinalipour-Yazti
|
|
Pages: 300-307 |
|
doi>10.1145/584792.584842 |
|
Full text: PDF
|
|
One important problem in peer-to-peer (P2P) networks is searching and retrieving the correct information. However, existing searching mechanisms in pure peer-to-peer networks are inefficient due to the decentralized nature of such networks. We propose ...
One important problem in peer-to-peer (P2P) networks is searching and retrieving the correct information. However, existing searching mechanisms in pure peer-to-peer networks are inefficient due to the decentralized nature of such networks. We propose two mechanisms for information retrieval in pure peer-to-peer networks. The first, the modified Breadth-First Search (BFS) mechanism, is an extension of the current Gnuttela protocol, allows searching with keywords, and is designed to minimize the number of messages that are needed to search the network. The second, the Intelligent Search mechanism, uses the past behavior of the P2P network to further improve the scalability of the search procedure. In this algorithm, each peer autonomously decides which of its peers are most likely to answer a given query. The algorithm is entirely distributed, and therefore scales well with the size of the network. We implemented our mechanisms as middleware platforms. To show the advantages of our mechanisms we present experimental results using the middleware implementation. expand
|
|
|
Intelligent knowledge discovery in peer-to-peer file sharing |
| |
Yugyung Lee,
Changgyu Oh,
Eun Kyo Park
|
|
Pages: 308-315 |
|
doi>10.1145/584792.584843 |
|
Full text: PDF
|
|
Emerging peer-to-peer computing provides new possibilities but also challenges for distributed applications. Despite their significant potential, current peer-to-peer networks lack efficient knowledge discovery and management. This paper addresses this ...
Emerging peer-to-peer computing provides new possibilities but also challenges for distributed applications. Despite their significant potential, current peer-to-peer networks lack efficient knowledge discovery and management. This paper addresses this deficiency and proposes the Intelligent File Sharing framework, which provides an effective and flexible query for P2P file sharing. The IFS is based on powerful schema and flexible inference, as well as efficiently integrated and extensible retrieval algorithms. Experimental results have provided evidence of the high performance and scalability of the Intelligent File Sharing (IFS) system in peer-to-peer environments. expand
|
|
|
Partial rollback in object-oriented/object-relational database management systems |
| |
Won-Young Kim,
Kyu-Young Whang,
Byung Suk Lee,
Young-Koo Lee,
Ji-Woong Chang
|
|
Pages: 316-323 |
|
doi>10.1145/584792.584844 |
|
Full text: PDF
|
|
In a database management system (DBMS), partial rollback is an important mechanism for canceling only part of the operations executed in a transaction back to a savepoint. Partial rollback complicates buffer management because it should restore the state ...
In a database management system (DBMS), partial rollback is an important mechanism for canceling only part of the operations executed in a transaction back to a savepoint. Partial rollback complicates buffer management because it should restore the state of the buffers as well as that of the database. Several relational DBMSs (RDBMSs) currently provide this mechanism using page buffers. However, object-oriented or object-relational DBMSs (OO/ORDBMSs) cannot utilize the partial rollback scheme of RDBMSs as is because, unlike RDBMSs, many of them use a dual buffer consisting of an object buffer and a page buffer. In this paper, we propose a thorough study of partial rollback schemes of OO/ORDBMSs with a dual buffer. First, we classify the partial rollback schemes of OO/ORDBMSs into a single buffer-based scheme and a dual buffer-based scheme by the number of buffers used to process rollback. Next, we propose four alternative partial rollback schemes: a page buffer-based scheme, an object buffer-based scheme, a dual buffer-based scheme using a soft log, and a dual buffer-based scheme using shadows. We then evaluate their performance through simulations. The results show that the dual buffer-based partial rollback scheme using shadows provides the best performance. Partial rollback in OO/ORDBMS has not been addressed in the literature; yet, it is a useful mechanism that must be implemented. The proposed schemes are practical ones that can be implemented in such DBMSs. expand
|
|
|
SESSION: Information retrieval 1 |
|
|
|
|
Query association for effective retrieval |
| |
Falk Scholer,
Hugh E. Williams
|
|
Pages: 324-331 |
|
doi>10.1145/584792.584846 |
|
Full text: PDF
|
|
We introduce a novel technique for document summarisation which we call query association. Query association is based on the notion that a query that is highly similar to a document is a good descriptor of that document. For example, the user query "richmond ...
We introduce a novel technique for document summarisation which we call query association. Query association is based on the notion that a query that is highly similar to a document is a good descriptor of that document. For example, the user query "richmond football club" is likely to be a good summary of the content of a document that is ranked highly in response to the query. We describe this process of defining, maintaining, and presenting the relationship between a user query and the documents that are retrieved in response to that query. We show that associated queries are an excellent technique for describing a document: for relevance judgement, associated queries are as effective as a simple online query-biased summarisation technique. As future work, we suggest additional uses for query association including relevance feedback and query expansion. expand
|
|
|
Pruning long documents for distributed information retrieval |
| |
Jie Lu,
Jamie Callan
|
|
Pages: 332-339 |
|
doi>10.1145/584792.584847 |
|
Full text: PDF
|
|
Query-based sampling is a method of discovering the contents of a text database by submitting queries to a search engine and observing the documents returned. In prior research sampled documents were used to build resource descriptions for automatic ...
Query-based sampling is a method of discovering the contents of a text database by submitting queries to a search engine and observing the documents returned. In prior research sampled documents were used to build resource descriptions for automatic database selection, and to build a centralized sample database for query expansion and result merging. An unstated assumption was that the associated storage costs were acceptable.When sampled documents are long, storage costs can be large. This paper investigates methods of pruning long documents to reduce storage costs. The experimental results demonstrate that building resource descriptions and centralized sample databases from the pruned contents of sampled documents can reduce storage costs by 54-93% while causing only minor losses in the accuracy of distributed information retrieval. expand
|
|
|
On arabic search: improving the retrieval effectiveness via a light stemming approach |
| |
Mohammed Aljlayl,
Ophir Frieder
|
|
Pages: 340-347 |
|
doi>10.1145/584792.584848 |
|
Full text: PDF
|
|
The inflectional structure of a word impacts the retrieval accuracy of information retrieval systems of Latin-based languages. We present two stemming algorithms for Arabic information retrieval systems. We empirically investigate the effectiveness of ...
The inflectional structure of a word impacts the retrieval accuracy of information retrieval systems of Latin-based languages. We present two stemming algorithms for Arabic information retrieval systems. We empirically investigate the effectiveness of surface-based retrieval. This approach degrades retrieval precision since Arabic is a highly inflected language. Accordingly, we propose root-based retrieval. We notice a statistically significant improvement over the surface-based approach. Many variant word senses are based on an identical root; thus, the root-based algorithm creates invalid conflation classes that result in an ambiguous query which degrades the performance by adding extraneous terms. To resolve ambiguity, we propose a novel light-stemming algorithm for Arabic texts. This automatic rule-based stemming algorithm is not as aggressive as the root extraction algorithm. We show that the light stemming algorithm significantly outperforms the root-based algorithm. We also show that a significant improvement in retrieval precision can be achieved with light inflectional analysis of Arabic words. expand
|
|
|
SESSION: Classification |
|
|
|
|
Boosting to correct inductive bias in text classification |
| |
Yan Liu,
Yiming Yang,
Jaime Carbonell
|
|
Pages: 348-355 |
|
doi>10.1145/584792.584850 |
|
Full text: PDF
|
|
This paper studies the effects of boosting in the context of different classification methods for text categorization, including Decision Trees, Naive Bayes, Support Vector Machines (SVMs) and a Rocchio-style classifier. We identify the inductive biases ...
This paper studies the effects of boosting in the context of different classification methods for text categorization, including Decision Trees, Naive Bayes, Support Vector Machines (SVMs) and a Rocchio-style classifier. We identify the inductive biases of each classifier and explore how boosting, as an error-driven resampling mechanism, reacts to those biases. Our experiments on the Reuters-21578 benchmark show that boosting is not effective in improving the performance of the base classifiers on common categories. However, the effect of boosting for rare categories varies across classifiers: for SVMs and Decision Trees, we achieved a 13-17% performance improvement in macro-averaged F1 measure, but did not obtain substantial improvement for the other two classifiers. This interesting finding of boosting on rare categories has not been reported before. expand
|
|
|
Using conjunction of attribute values for classification |
| |
Mukund Deshpande,
George Karypis
|
|
Pages: 356-364 |
|
doi>10.1145/584792.584851 |
|
Full text: PDF
|
|
Advances in the efficient discovery of frequent itemsets have led to the development of a number of schemes that use frequent itemsets to aid developing accurate and efficient classifiers. These approaches use the frequent itemsets to generate a set ...
Advances in the efficient discovery of frequent itemsets have led to the development of a number of schemes that use frequent itemsets to aid developing accurate and efficient classifiers. These approaches use the frequent itemsets to generate a set of composite features that expand the dimensionality of the underlying dataset. In this paper, we build upon this work and (i) present a variety of schemes for composite feature selection that achieve a substantial reduction in the number of features without adversely affecting the accuracy gains, and (ii) show (both analytically and experimentally) that the composite features can lead to improved classification models even in the context of support vector machines, in which the dimensionality can automatically be expanded by the use of appropriate kernel functions. expand
|
|
|
Categorizing information objects from user access patterns |
| |
Mao Chen,
Andrea LaPaugh,
Jaswinder Pal Singh
|
|
Pages: 365-372 |
|
doi>10.1145/584792.584852 |
|
Full text: PDF
|
|
Many web sites have dynamic information objects whose topics change over time. Classifying these objects automatically and promptly is a challenging and important problem for site masters. Traditional content-based and link structure based ...
Many web sites have dynamic information objects whose topics change over time. Classifying these objects automatically and promptly is a challenging and important problem for site masters. Traditional content-based and link structure based classification techniques have intrinsic limitations for this task. This paper proposes a framework to classify an object into an existing category structure by analyzing the users' traversals in the category structure. The key idea is to infer an object's topic from the predicted preferences of users when they access the object. We compare two approaches using this idea. One analyzes collective user behavior and the other each user's accesses. We present experimental results on actual data that demonstrate a much higher prediction accuracy and applicability with the latter approach. We also analyze the correlation between classification quality and various factors such as the number of users accessing the object. To our knowledge, this work is the first effort in combining object classification with user access prediction. expand
|
|
|
Knowledge and information management: is it possible to do interesting and important research, get funded, be useful and appreciated? |
| |
Maria Zemankova
|
|
Pages: 373-374 |
|
doi>10.1145/584792.584795 |
|
Full text: PDF
|
|
The survey of the CIKM Call for Papers for the period 1998 - 2002 demonstrates that the CIKM organizers very accurately "identify challenging problems facing the development of future knowledge and information systems [in] applied and theoretical research" ...
The survey of the CIKM Call for Papers for the period 1998 - 2002 demonstrates that the CIKM organizers very accurately "identify challenging problems facing the development of future knowledge and information systems [in] applied and theoretical research" [1998] and also play an important role fostering "bridging traditionally separated areas such as databases and information retrieval, or those that apply techniques from one area to another" [2001, 2002]. The presented CIKM papers also indicate that researchers work on interesting problems. This talk will discuss some additional research topics for future consideration by the CIKM community.In most cases, to achieve important results, research needs to be well supported. If you are not getting the funding you need, this talk may provide some pointers where you can look for finding. If you are well supported, I will try to convince you that you can be instrumental in improving the funding scenario for everybody, by mentoring the junior members of the CIKM community, by forming collaborative (international, interdisciplinary) teams and by letting the funders know what you find conducive to your research and what you consider a hindrance. Regardless if you are well funded or not, it is most helpful if you are active in identifying new research directions and also assist in evaluating the priorities.The most frequent problem for inadequate funding is lack of funds. However, researchers can help! This can be achieved in many different ways: thinking about long-term applications of fundamental research to societal needs; working with communities that can directly benefit from research; sharing the research results not only with the research colleagues, but also wider constituencies - at their appropriate levels; and informing your funders about your spectacular achievements, i.e., providing good reasons for increasing the research funding.CIKM strives to bring together research communities that traditionally do not work together. Providing a forum for interdisciplinary research is laudable and very important, as very often interdisciplinary or international research is not "appreciated". This talk will discuss how we could gradually change the discipline- and country-based "appraisal" cultures. We will also attempt to answer the question: "What is the ultimate appreciation?" (Nobel Prize? ...successful .com?... ???). expand
|
|
|
SESSION: Language models for information retrieval |
|
|
|
|
Passage retrieval based on language models |
| |
Xiaoyong Liu,
W. Bruce Croft
|
|
Pages: 375-382 |
|
doi>10.1145/584792.584854 |
|
Full text: PDF
|
|
Previous research has shown that passage-level evidence can bring added benefits to document retrieval when documents are long or span different subject areas. Recent developments in language modeling approach to IR provided a new effective alternative ...
Previous research has shown that passage-level evidence can bring added benefits to document retrieval when documents are long or span different subject areas. Recent developments in language modeling approach to IR provided a new effective alternative to traditional retrieval models. These two streams of research motivate us to examine the use of passages in a language model framework. This paper reports on experiments using passages in a simple language model and a relevance model, and compares the results with document-based retrieval. Results from the INQUERY search engine, which is not based on a language modeling approach, are also given for comparison. Test data include two heterogeneous and one homogeneous document collections. Our experiments show that passage retrieval is feasible in the language modeling context, and more importantly, it can provide more reliable performance than retrieval based on full documents. expand
|
|
|
Capturing term dependencies using a language model based on sentence trees |
| |
Ramesh Nallapati,
James Allan
|
|
Pages: 383-390 |
|
doi>10.1145/584792.584855 |
|
Full text: PDF
|
|
We describe a new probabilistic Sentence Tree Language Modeling approach that captures term dependency patterns in Topic Detection and Tracking's (TDT) Story Link Detection task. New features of the approach include modeling the syntactic structure of ...
We describe a new probabilistic Sentence Tree Language Modeling approach that captures term dependency patterns in Topic Detection and Tracking's (TDT) Story Link Detection task. New features of the approach include modeling the syntactic structure of sentences in documents by a sentence-bin approach and a computationally efficient algorithm for capturing the most significant sentence-level term dependencies using a Maximum Spanning Tree approach, similar to Van Rijsbergen's modeling of document-level term dependencies.The new model is a good discriminator of on-topic and off-topic story pairs providing evidence that sentence-level term dependencies contain significant information about relevance. Although runs on a subset of the TDT2 corpus show that the model is outperformed by the unigram language model, a mixture of the unigram and the Sentence Tree models is shown to improve on the best performance especially in the regions of low false alarms. expand
|
|
|
A language modeling framework for resource selection and results merging |
| |
Luo Si,
Rong Jin,
Jamie Callan,
Paul Ogilvie
|
|
Pages: 391-397 |
|
doi>10.1145/584792.584856 |
|
Full text: PDF
|
|
Statistical language models have been proposed recently for several information retrieval tasks, including the resource selection task in distributed information retrieval. This paper extends the language modeling approach to integrate resource selection, ...
Statistical language models have been proposed recently for several information retrieval tasks, including the resource selection task in distributed information retrieval. This paper extends the language modeling approach to integrate resource selection, ad-hoc searching, and merging of results from different text databases into a single probabilistic retrieval model. This new approach is designed primarily for Intranet environments, where it is reasonable to assume that resource providers are relatively homogeneous and can adopt the same kind of search engine. Experiments demonstrate that this new, integrated approach is at least as effective as the prior state-of-the-art in distributed IR. expand
|
|
|
SESSION: Spatial search and moving objects |
|
|
|
|
An efficient and effective algorithm for density biased sampling |
| |
Alexandros Nanopoulos,
Yannis Manolopoulos,
Yannis Theodoridis
|
|
Pages: 398-404 |
|
doi>10.1145/584792.584858 |
|
Full text: PDF
|
|
In this paper we describe a new density-biased sampling algorithm. It exploits spatial indexes and the local density information they preserve, to provide improved quality of sampling result and fast access to elements of the dataset. It attains improved ...
In this paper we describe a new density-biased sampling algorithm. It exploits spatial indexes and the local density information they preserve, to provide improved quality of sampling result and fast access to elements of the dataset. It attains improved sampling quality, with respect to factors like skew, noise or dimensionality. Moreover, it has the advantage of efficiently handling dynamic updates, and it requires low execution times. The performance of the proposed method is examined experimentally. The comparative results illustrate its superiority over existing methods. expand
|
|
|
"GeoPlot": spatial data mining on video libraries |
| |
Jia-Yu Pan,
Christos Faloutsos
|
|
Pages: 405-412 |
|
doi>10.1145/584792.584859 |
|
Full text: PDF
|
|
Are "tornado" touchdowns related to "earthquakes"? How about to "floods", or to "hurricanes"? In Informedia [14], using a gazetteer on news video clips, we map news onto points on the globe and find correlations between sets of points. In this paper ...
Are "tornado" touchdowns related to "earthquakes"? How about to "floods", or to "hurricanes"? In Informedia [14], using a gazetteer on news video clips, we map news onto points on the globe and find correlations between sets of points. In this paper we show how to find answers to such questions, and how to look for patterns on the geo-spatial relationships of news events. The proposed tool is "GeoPlot", which is fast to compute and gives a lot of useful information which traditional text retrieval can not find.We describe our experiments on 2-year worth of video data (~ 20 Gbytes). There we found that GeoPlot can find unexpected correlations that text retrieval would never find, such as those between "earthquake" and "volcano", and "tourism" and "wine".In addition, GeoPlot provides a good visualization of a data set's characteristics. Characteristics at all scales are shown in one plot and a wealth of information is given, for example, geo-spatial clusters, characteristic scales, and intrinsic (fractal) dimensions of the events' locations. expand
|
|
|
Trajectory queries and octagons in moving object databases |
| |
Hongjun Zhu,
Jianwen Su,
Oscar H. Ibarra
|
|
Pages: 413-421 |
|
doi>10.1145/584792.584860 |
|
Full text: PDF
|
|
An important class of queries in moving object databases involves trajectories. We propose to divide trajectory predicates into topological and non-topological parts; extend the 9 intersection model of Egenhofer-Franzosa to a 3-step evaluation strategy ...
An important class of queries in moving object databases involves trajectories. We propose to divide trajectory predicates into topological and non-topological parts; extend the 9 intersection model of Egenhofer-Franzosa to a 3-step evaluation strategy for trajectory queries: a filter step, a refinement step, and a tracing step.The filter and refinement steps are similar to region searches. As in spatial databases, approximations of trajectories are typically used in evaluating trajectory queries. In earlier studies, minimum bounding boxes (mbrs) are used to approximate trajectory segments which allow index structures to be built, e.g., TB-trees and R*-trees. The use of mbrs hinders the efficiency since mbrs are very coarse approximations especially for trajectory segments. To overcome this problem, we propose a new type of approximations, "minimum bounding octagon prism" mbop. We extend R*-tree to a new index structure "Octagon-Prism tree" (OP-tree) for mbops of trajectory segments. We conducted experiments to evaluate efficiency of OP-trees in performing region searches and trajectory queries. The results show that OP-trees improve region searches significantly over synthetic trajectory data sets to TB-trees and R*-trees and can significantly reduce the evaluation cost of trajectory queries compared to TB-trees. expand
|
|
|
SESSION: Music information retrieval |
|
|
|
|
The effectiveness study of various music information retrieval approaches |
| |
Jia-Lien Hsu,
Arbee L. P. Chen,
Hung-Chen Chen,
Ning-Han Liu
|
|
Pages: 422-429 |
|
doi>10.1145/584792.584862 |
|
Full text: PDF
|
|
In this paper, we describe the Ultima project which aims to construct a platform for evaluating various approaches of music information retrieval. Two kinds of approaches are adopted in this project. These approaches differ in various aspects, such as ...
In this paper, we describe the Ultima project which aims to construct a platform for evaluating various approaches of music information retrieval. Two kinds of approaches are adopted in this project. These approaches differ in various aspects, such as representations of music objects, index structures, and approximate query processing strategies. For a fair comparison, we propose a measurement of the retrieval effectiveness by recall-precision curves with a scaling factor adjustment. Finally, the performance study of the retrieval effectiveness based on various factors of these approaches is presented. expand
|
|
|
Harmonic models for polyphonic music retrieval |
| |
Jeremy Pickens,
Tim Crawford
|
|
Pages: 430-437 |
|
doi>10.1145/584792.584863 |
|
Full text: PDF
|
|
Most work in the ad hoc music retrieval field has focused on the retrieval of monophonic documents using monophonic queries. Polyphony adds considerably more complexity. We present a method by which polyphonic music documents may be retrieved by polyphonic ...
Most work in the ad hoc music retrieval field has focused on the retrieval of monophonic documents using monophonic queries. Polyphony adds considerably more complexity. We present a method by which polyphonic music documents may be retrieved by polyphonic music queries. A new harmonic description technique is given, wherein the information from all chords, rather than the most significant chord, is used. This description is then combined in a new and unique way with Markov statistical methods to create models of both documents and queries. Document models are compared to query models and then ranked by score. Though test collections for music are currently scarce, we give the first known recall-precision graphs for polyphonic music retrieval, and results are favorable. expand
|
|
|
A singer identification technique for content-based classification of MP3 music objects |
| |
Chih-Chin Liu,
Chuan-Sung Huang
|
|
Pages: 438-445 |
|
doi>10.1145/584792.584864 |
|
Full text: PDF
|
|
As there is a growing amount of MP3 music data available on the Internet today, the problems related to music classification and content-based music retrieval are getting more attention recently. In this paper, we propose an approach to automatically ...
As there is a growing amount of MP3 music data available on the Internet today, the problems related to music classification and content-based music retrieval are getting more attention recently. In this paper, we propose an approach to automatically classify MP3 music objects according to their singers. First, the coefficients extracted from the output of the polyphase filters are used to compute the MP3 features for segmentation. Based on these features, an MP3 music object can be decomposed into a sequence of notes (or phonemes). Then for each MP3 phoneme in the training set, its MP3 feature is extracted and used to train an MP3 classifier which can identify the singer of an unknown MP3 music object. Experiments are performed and analyzed to show the effectiveness of the proposed method. expand
|
|
|
SESSION: XML constraints and the semantic web |
|
|
|
|
XKvalidator: a constraint validator for XML |
| |
Yi Chen,
Susan B. Davidson,
Yifeng Zheng
|
|
Pages: 446-452 |
|
doi>10.1145/584792.584866 |
|
Full text: PDF
|
|
The role of XML in data exchange is evolving from one of merely conveying the structure of data to one that also conveys its semantics. In particular, several proposals for key and foreign key constraints have recently appeared, and aspects of these ...
The role of XML in data exchange is evolving from one of merely conveying the structure of data to one that also conveys its semantics. In particular, several proposals for key and foreign key constraints have recently appeared, and aspects of these proposals have been adopted within XMLSchema.In this paper, we examine the problem of checking keys and foreign keys in XML documents using a validator based on SAX. The algorithm relies on an indexing technique based on the paths found in key definitions, and can be used for checking the correctness of an entire document (bulk checking) as well as for checking updates as they are made to the document (incremental checking). The asymptotic performance of the algorithm is linear in the size of the document or update. Furthermore, experimental results demonstrate reasonable performance. expand
|
|
|
Discovering approximate keys in XML data |
| |
Gösta Grahne,
Jianfei Zhu
|
|
Pages: 453-460 |
|
doi>10.1145/584792.584867 |
|
Full text: PDF
|
|
Keys are very important in many aspects of data management, such as guiding query formulation, query optimization, indexing, etc. We consider the situation where an XML document does not come with key definitions, and we are interested in using data ...
Keys are very important in many aspects of data management, such as guiding query formulation, query optimization, indexing, etc. We consider the situation where an XML document does not come with key definitions, and we are interested in using data mining techniques to obtain a representation of the keys holding in a document. In order to have a compact representation of the set of keys holding in a document, we define a partial order on the set of all key expressions. This order is based on an analysis of the properties of absolute and relative keys for XML. Given the existence of the partial order, only a reduced set of key expressions need to be discovered.Due to the semistructured nature of XML documents, it turns out to be useful to consider keys that hold in "almost" the whole document, that is, they are violated only in a small part of the document. To this end, the support and confidence of a key expression are also defined, and the concept of approximate key expression is introduced. We give an efficient algorithm to mine a reduced set of approximate keys from an XML document. expand
|
|
|
Information retrieval on the semantic web |
| |
Urvi Shah,
Tim Finin,
Anupam Joshi,
R. Scott Cost,
James Matfield
|
|
Pages: 461-468 |
|
doi>10.1145/584792.584868 |
|
Full text: PDF
|
|
We describe an approach to retrieval of documents that contain of both free text and semantically enriched markup. In particular, we present the design and implementation prototype of a framework in which both documents and queries can be marked up with ...
We describe an approach to retrieval of documents that contain of both free text and semantically enriched markup. In particular, we present the design and implementation prototype of a framework in which both documents and queries can be marked up with statements in the DAML+OIL semantic web language. These statements provide both structured and semi-structured information about the documents and their content. We claim that indexing text and semantic markup together will significantly improve retrieval performance. Our approach allows inferencing to be done over this information at several points: when a document is indexed, when a query is processed and when query results are evaluated. expand
|
|
|
SESSION: Data streams and time-series |
|
|
|
|
RHist: adaptive summarization over continuous data streams |
| |
Lin Qiao,
Divyakant Agrawal,
Amr El Abbadi
|
|
Pages: 469-476 |
|
doi>10.1145/584792.584870 |
|
Full text: PDF
|
|
Maintaining approximate aggregates and summaries over data streams is crucial to handle the OLAP query workload that arises in applications, such as network monitoring and telecommunications. Furthermore, since the entire data is not available at all ...
Maintaining approximate aggregates and summaries over data streams is crucial to handle the OLAP query workload that arises in applications, such as network monitoring and telecommunications. Furthermore, since the entire data is not available at all times the maintenance task must be done incrementally. We show that R(elaxed)Hist(ogram) is an appropriate summarization under data stream scenario. In order to reduce query estimation errors, we propose adaptive approaches which not only capture the data distribution, but also integrate independent query patterns. We introduce a workload decay model to efficiently capture global workload information and ensure that the query patterns from the recent past are weighted more than queries that are further in the past. We verify experimentally that our approach successfully adapts to continuously changing workload as well as data streams. expand
|
|
|
Efficient query monitoring using adaptive multiple key hashing |
| |
Kun-Lung Wu,
Philip S. Yu
|
|
Pages: 477-484 |
|
doi>10.1145/584792.584871 |
|
Full text: PDF
|
|
Monitoring continual queries or subscriptions is to determine the subset of all queries or subscriptions whose predicates match a given event. Predicates contain not only equality but also non-equality clauses. Event matching is usually accomplished ...
Monitoring continual queries or subscriptions is to determine the subset of all queries or subscriptions whose predicates match a given event. Predicates contain not only equality but also non-equality clauses. Event matching is usually accomplished by first identifying a "small" candidate set of subscriptions for an event and then determining the matched subscriptions from the candidate set. Prior work has focused on using equality clauses to identify the candidate set. However, we found that completely ignoring non-equality clauses can result in a much larger candidate set. In this paper, we present and evaluate an adaptive multiple key hashing (AMKH) method to judiciously include an effective subset of non-equality clauses in candidate set identification. Each subscription is mapped to a data point in a multidimensional space based on its predicate clauses. AMKH is then used to maintain subscriptions and perform event matching. AMKH further provides a controlling mechanism to limit the hash range of a non-equality clause, hence reducing the size of the candidate set. Simulations are conducted to study the performance of AMKH. The results show that (1) a small number of non-equality clauses can be effectively included by AMKH and (2) the attributes whose overall non-equality predicate clauses are most selective should be chosen for inclusion by AMKH. expand
|
|
|
Evaluating continuous nearest neighbor queries for streaming time series via pre-fetching |
| |
Like Gao,
Zhengrong Yao,
X. Sean Wang
|
|
Pages: 485-492 |
|
doi>10.1145/584792.584872 |
|
Full text: PDF
|
|
For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions. Such a standing request is ...
For many applications, it is important to quickly locate the nearest neighbor of a given time series. When the given time series is a streaming one, nearest neighbors may need to be found continuously at all time positions. Such a standing request is called a continuous nearest neighbor query. This paper seeks fast evaluation of continuous queries on large databases. The initial strategy is to use the result of one evaluation to restrict the search space for the next. A more fundamental idea is to extend the existing indexing methods, used in many traditional nearest neighbor algorithms, with pre-fetching. Specifically, pre-fetching is to predict the next value of the stream before it arrives, and to process the query as if the predicted value were the real one in order to load the needed index pages and time series into the allocated cache memory. Furthermore, if the pre-fetched candidates cannot fit into the cache memory, they are stored in a sequential file to facilitate fast access to them. Experiments show that pre-fetching improves the response time greatly over the direct use of traditional algorithms, even if the caching provided by the operating system is taken into consideration. expand
|
|
|
Mining temporal classes from time series data |
| |
Masahiro Motoyoshi,
Takao Miura,
Kohei Watanabe
|
|
Pages: 493-498 |
|
doi>10.1145/584792.584873 |
|
Full text: PDF
|
|
In this investigation, we discuss how to mine Temporal Class Schemes to model a collection of time series data. From the viewpoint of temporal data mining, this problem can be seen as discretizing time series data or aggregating them. Also ...
In this investigation, we discuss how to mine Temporal Class Schemes to model a collection of time series data. From the viewpoint of temporal data mining, this problem can be seen as discretizing time series data or aggregating them. Also this can be considered as screening (or noise filtering). From the viewpoint of temporal databases, the issue is how we represent the data and how we can obtain intensional aspects as temporal schemes. In other words, we discuss scheme discovery for temporal data. Given a collection of temporal objects along with time axis (called log), we examine the data and we introduce a notion of temporal frequent classes to describe them. As the main results of this investigation, we can show that there exists one and only one interval decomposition and the temporal classes related to them. Also we give experimental results that prove the feasibility to time series data. expand
|
|
|
SESSION: Web clustering |
|
|
|
|
Evaluating contents-link coupled web page clustering for web search results |
| |
Yitong Wang,
Masaru Kitsuregawa
|
|
Pages: 499-506 |
|
doi>10.1145/584792.584875 |
|
Full text: PDF
|
|
Clustering is currently one of the most crucial techniques for dealing (e.g. resources locating, information interpreting) with massive amount of heterogeneous information on the web. Unlike clustering in other fields, web page clustering separates unrelated ...
Clustering is currently one of the most crucial techniques for dealing (e.g. resources locating, information interpreting) with massive amount of heterogeneous information on the web. Unlike clustering in other fields, web page clustering separates unrelated pages and clusters related pages (to a specific topic) into semantically meaningful groups, which is useful for discrimination, summarization, organization and navigation of unstructured web pages. We have proposed a contents-link coupled clustering algorithm that clusters web pages by combining contents and link analysis. In this paper, we particularly study the effects of out-links (from the web pages), in-links (to the web page) and terms on the final clustering results as well as how to effectively combine these three parts to improve the quality of clustering results. We apply it to cluster web search results. Preliminary experiments and evaluations are conducted on various topics. As the experimental results show, the proposed clustering algorithm is effective and promising. expand
|
|
|
Inferring hierarchical descriptions |
| |
Eric Glover,
David M. Pennock,
Steve Lawrence,
Robert Krovetz
|
|
Pages: 507-514 |
|
doi>10.1145/584792.584876 |
|
Full text: PDF
|
|
We create a statistical model for inferring hierarchical term relationships about a topic, given only a small set of example web pages on the topic, without prior knowledge of any hierarchical information. The model can utilize either the full text of ...
We create a statistical model for inferring hierarchical term relationships about a topic, given only a small set of example web pages on the topic, without prior knowledge of any hierarchical information. The model can utilize either the full text of the pages in the cluster or the context of links to the pages. To support the model, we use "ground truth" data taken from the category labels in the Open Directory. We show that the model accurately separates terms in the following classes: self terms describing the cluster, parent terms describing more general concepts, and child terms describing specializations of the cluster. For example, for a set of biology pages, sample parent, self, and child terms are science, biology, and genetics respectively. We create an algorithm to predict parent, self, and child terms using the new model, and compare the predictions to the ground truth data. The algorithm accurately ranks a majority of the ground truth terms highly, and identifies additional complementary terms missing in the Open Directory. expand
|
|
|
Evaluation of hierarchical clustering algorithms for document datasets |
| |
Ying Zhao,
George Karypis
|
|
Pages: 515-524 |
|
doi>10.1145/584792.584877 |
|
Full text: PDF
|
|
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering ...
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone. expand
|
|
|
Strategies for minimising errors in hierarchical web categorisation |
| |
Wahyu Wibowo,
Hugh E. Williams
|
|
Pages: 525-531 |
|
doi>10.1145/584792.584878 |
|
Full text: PDF
|
|
On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual ...
On the Web, browsing and searching categories is a popular method of finding documents. Two well-known category-based search systems are the Yahoo!~and DMOZ hierarchies, which are maintained by experts who assign documents to categories. However, manual categorisation by experts is costly, subjective, and not scalable with the increasing volumes of data that must be processed. Several methods have been investigated for effective automatic text categorisation. These include selection of categorisation methods, selection of pre-categorised training samples, use of hierachies, and selection of document fragments or features. In this paper, we further investigate categorisation into Web hierarchies and the role of hierarchical information in improving categorisation effectiveness. We introduce new strategies to reduce errors in hierarchical categorisation. In particular, we propose novel techniques that shift the assignment into higher level categories when lower level assignment is uncertain. Our results show that absolute error rates can be reduced by over 2%. expand
|
|
|
SESSION: Information retrieval |
|
|
|
|
Knowledge-based extraction of named entities |
| |
Jamie Callan,
Teruko Mitamura
|
|
Pages: 532-537 |
|
doi>10.1145/584792.584880 |
|
Full text: PDF
|
|
The usual approach to named-entity detection is to learn extraction rules that rely on linguistic, syntactic, or document format patterns that are consistent across a set of documents. However, when there is no consistency among documents, it may be ...
The usual approach to named-entity detection is to learn extraction rules that rely on linguistic, syntactic, or document format patterns that are consistent across a set of documents. However, when there is no consistency among documents, it may be more effective to learn document-specific extraction rules.This paper presents a knowledge-based approach to learning rules for named-entity extraction. Document-specific extraction rules are created using a generate-and-test paradigm and a database of known named-entities. Experimental results show that this approach is effective on Web documents that are difficult for the usual methods. expand
|
|
|
Condorcet fusion for improved retrieval |
| |
Mark Montague,
Javed A. Aslam
|
|
Pages: 538-548 |
|
doi>10.1145/584792.584881 |
|
Full text: PDF
|
|
We present a new algorithm for improving retrieval results by combining document ranking functions: Condorcet-fuse. Beginning with one of the two major classes of voting procedures from Social Choice Theory, the Condorcet procedure, we apply a ...
We present a new algorithm for improving retrieval results by combining document ranking functions: Condorcet-fuse. Beginning with one of the two major classes of voting procedures from Social Choice Theory, the Condorcet procedure, we apply a graph-theoretic analysis that yields a sorting-based algorithm that is elegant, efficient, and effective. The algorithm performs very well on TREC data, often outperforming existing metasearch algorithms whether or not relevance scores and training data is available. Condorcet-fuse significantly outperforms Borda-fuse, the analogous representative from the other major class of voting algorithms. expand
|
|
|
I/O-efficient techniques for computing pagerank |
| |
Yen-Yu Chen,
Qingqing Gan,
Torsten Suel
|
|
Pages: 549-557 |
|
doi>10.1145/584792.584882 |
|
Full text: PDF
|
|
Over the last few years, most major search engines have integrated link-based ranking techniques in order to provide more accurate search results. One widely known approach is the Pagerank technique, which forms the basis of the Google ranking scheme, ...
Over the last few years, most major search engines have integrated link-based ranking techniques in order to provide more accurate search results. One widely known approach is the Pagerank technique, which forms the basis of the Google ranking scheme, and which assigns a global importance measure to each page based on the importance of other pages pointing to it. The main advantage of the Pagerank measure is that it is independent of the query posed by a user; this means that it can be precomputed and then used to optimize the layout of the inverted index structure accordingly. However, computing the Pagerank measure requires implementing an iterative process on a massive graph corresponding to billions of web pages and hyperlinks.In this paper, we study I/O-efficient techniques to perform this iterative computation. We derive two algorithms for Pagerank based on techniques proposed for out-of-core graph algorithms, and compare them to two existing algorithms proposed by Haveliwala. We also consider the implementation of a recently proposed topic-sensitive version of Pagerank. Our experimental results show that for very large data sets, significant improvements over previous results can be achieved on machines with moderate amounts of memory. On the other hand, at most minor improvements are possible on data sets that are only moderately larger than memory, which is the case in many practical scenarios. expand
|
|
|
SESSION: Web search 2 |
|
|
|
|
Personalized web search by mapping user queries to categories |
| |
Fang Liu,
Clement Yu,
Weiyi Meng
|
|
Pages: 558-565 |
|
doi>10.1145/584792.584884 |
|
Full text: PDF
|
|
Current web search engines are built to serve all users, independent of the needs of any individual user. Personalization of web search is to carry out retrieval for each user incorporating his/her interests. We propose a novel technique to map a user ...
Current web search engines are built to serve all users, independent of the needs of any individual user. Personalization of web search is to carry out retrieval for each user incorporating his/her interests. We propose a novel technique to map a user query to a set of categories, which represent the user's search intention. This set of categories can serve as a context to disambiguate the words in the user's query. A user profile and a general profile are learned from the user's search history and a category hierarchy respectively. These two profiles are combined to map a user query into a set of categories. Several learning and combining algorithms are evaluated and found to be effective. Among the algorithms to learn a user profile, we choose the Rocchio-based method for its simplicity, efficiency and its ability to be adaptive. Experimental results indicate that our technique to personalize web search is both effective and efficient. expand
|
|
|
Using micro information units for internet search |
| |
Xiaoli Li,
Tong-Heng Phang,
Minqing Hu,
Bing Liu
|
|
Pages: 566-573 |
|
doi>10.1145/584792.584885 |
|
Full text: PDF
|
|
Internet search is one of the most important applications of the Web. A search engine takes the user's keywords to retrieve and to rank those pages that contain the keywords. One shortcoming of existing search techniques is that they do not give due ...
Internet search is one of the most important applications of the Web. A search engine takes the user's keywords to retrieve and to rank those pages that contain the keywords. One shortcoming of existing search techniques is that they do not give due consideration to the micro-structures of a Web page. A Web page is often populated with a number of small information units, which we call micro information units (MIU). Each unit focuses on a specific topic and occupies a specific area of the page. During the search, if all the keywords in the user query occur in a single MIU of a page, the top ranking results returned by a search engine are generally relevant and useful. However, if the query words scatter at different MIUs in a page, the pages returned can be quite irrelevant (which causes low precision). The reason for this is that although a page has information on individual MIUs, it may not have information on their intersections. In this paper, we propose a technique to solve this problem. At the off-line pre-processing stage, we segment each page to identify the MIUs in the page, and index the keywords of the page according to the MIUs in which they occur. In searching, our retrieval and ranking algorithm utilizes this additional information to return those most relevant pages. Experimental results show that this method is able to significantly improve the search precision. expand
|
|
|
Entropy-based link analysis for mining web informative structures |
| |
Hung-Yu Kao,
Ming-Syan Chen,
Shian-Hua Lin,
Jan-Ming Ho
|
|
Pages: 574-581 |
|
doi>10.1145/584792.584886 |
|
Full text: PDF
|
|
In this paper, we study the problem of mining the informative structure of a news Web site which consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., ...
In this paper, we study the problem of mining the informative structure of a news Web site which consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by TOC pages through informative links. It is noted that the Hyperlink Induced Topics Search (HITS) algorithm has been employed to provide a solution to analyzing authorities and hubs of pages. However, most of the content sites tend to contain some extra hyperlinks, such as navigation panels, advertisements and banners, so as to increase the add-on values of their Web pages. Therefore, due to the structure induced by these extra hyperlinks, HITS is found to be insufficient to provide a good precision in solving the problem. To remedy this, we develop an algorithm to utilize entropy-based Link Analysis on Mining Web Informative Structures. This algorithm is referred to as LAMIS. The key idea of LAMIS is to utilize information entropy for representing the knowledge that corresponds to the amount of information in a link or a page in the link analysis. Experiments on several real news Web sites show that the precision and the recall of LAMIS are much superior to those obtained by heuristic methods and conventional ink analysis methods. expand
|
|
|
SESSION: Clustering algorithms |
|
|
|
|
COOLCAT: an entropy-based algorithm for categorical clustering |
| |
Daniel Barbará,
Yi Li,
Julia Couto
|
|
Pages: 582-589 |
|
doi>10.1145/584792.584888 |
|
Full text: PDF
|
|
In this paper we explore the connection between clustering categorical data and entropy: clusters of similar poi lower entropy than those of dissimilar ones. We use this connection to design an incremental heuristic algorithm, COOLCAT, which is capable ...
In this paper we explore the connection between clustering categorical data and entropy: clusters of similar poi lower entropy than those of dissimilar ones. We use this connection to design an incremental heuristic algorithm, COOLCAT, which is capable of efficiently clustering large data sets of records with categorical attributes, and data streams. In contrast with other categorical clustering algorithms published in the past, COOLCAT's clustering results are very stable for different sample sizes and parameter settings. Also, the criteria for clustering is a very intuitive one, since it is deeply rooted on the well-known notion of entropy. Most importantly, COOLCAT is well equipped to deal with clustering of data streams(continuously arriving streams of data point) since it is an incremental algorithm capable of clustering new points without having to look at every point that has been clustered so far. We demonstrate the efficiency and scalability of COOLCAT by a series of experiments on real and synthetic data sets. expand
|
|
|
FREM: fast and robust EM clustering for large data sets |
| |
Carlos Ordonez,
Edward Omiecinski
|
|
Pages: 590-599 |
|
doi>10.1145/584792.584889 |
|
Full text: PDF
|
|
Clustering is a fundamental Data Mining technique. This article presents an improved EM algorithm to cluster large data sets having high dimensionality, noise and zero variance problems. The algorithm incorporates improvements to increase the quality ...
Clustering is a fundamental Data Mining technique. This article presents an improved EM algorithm to cluster large data sets having high dimensionality, noise and zero variance problems. The algorithm incorporates improvements to increase the quality of solutions and speed. In general the algorithm can find a good clustering solution in 3 scans over the data set. Alternatively, it can be run until it converges. The algorithm has a few parameters that are easy to set and have defaults for most cases. The proposed algorithm is compared against the standard EM algorithm and the On-Line EM algorithm. expand
|
|
|
Alternatives to the k-means algorithm that find better clusterings |
| |
Greg Hamerly,
Charles Elkan
|
|
Pages: 600-607 |
|
doi>10.1145/584792.584890 |
|
Full text: PDF
|
|
We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k-harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic ...
We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k-harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to find which aspects of these algorithms contribute to finding good clusterings, as opposed to converging to a low-quality local optimum. We describe each algorithm in a unified framework that introduces separate cluster membership and data weight functions. We then show that the algorithms do behave very differently from each other on simple low-dimensional synthetic datasets and image segmentation tasks, and that the k-harmonic means method is superior. Having a soft membership function is essential for finding high-quality clusterings, but having a non-constant data weight function is useful also. expand
|
|
|
SESSION: Industry session 1: knowledge management and semantics |
|
|
|
|
Thematic mapping - from unstructured documents to taxonomies |
| |
Christina Yip Chung,
Raymond Lieu,
Jinhui Liu,
Alpha Luk,
Jianchang Mao,
Prabhakar Raghavan
|
|
Pages: 608-610 |
|
doi>10.1145/584792.584892 |
|
Full text: PDF
|
|
Verity Inc. has developed a comprehensive suite of tools for accurately and efficiently organizing enterprise content which involves four basic steps: (i) creating taxonomies, (ii) building classification models, (iii) populating taxonomies with documents, ...
Verity Inc. has developed a comprehensive suite of tools for accurately and efficiently organizing enterprise content which involves four basic steps: (i) creating taxonomies, (ii) building classification models, (iii) populating taxonomies with documents, and (iv) deploying populated taxonomies in enterprise portals. A taxonomy is a hierarchical representation of categories. A taxonomy provides a navigation structure for exploring and understanding the underlying corpus without sifting through a huge volume of documents. Thematic Mapping automatically discovers a concept tree from a corpus of unstructured documents and assigns meaningful labels to concepts based on a semantic network. Integrating with Verity Intelligent Classifier's user-friendly GUI, a user can drill down a concept tree for navigation, perform a conceptual search to retrieve documents pertaining to a concept, build a taxonomy from the concept tree, as well as edit a taxonomy to tailor it into various views (customized taxonomies) of the same corpus. Classification rules can be generated automatically from concepts. These classification rules can be used for populating documents into the taxonomy. expand
|
|
|
Semantic technology applications for homeland security |
| |
D. Avant,
M. Baum,
C. Bertram,
M. Fisher,
A. Sheth,
Y. Warke
|
|
Pages: 611-613 |
|
doi>10.1145/584792.584893 |
|
Full text: PDF
|
|
|
|
|
Rule-based data quality |
| |
David Loshin
|
|
Pages: 614-616 |
|
doi>10.1145/584792.584894 |
|
Full text: PDF
|
|
In the business intelligence/data warehouse user community, there is a growing confusion as to the difference between data cleansing and data quality. While many data cleansing products can help in applying data edits to name and address ...
In the business intelligence/data warehouse user community, there is a growing confusion as to the difference between data cleansing and data quality. While many data cleansing products can help in applying data edits to name and address data, or help in transforming data during an ETL process, there is usually no persistence in this cleansing. This paper describes how we have implemented a business rules approach to build a data validation engine, called GuardianIQ, that transforms declarative data quality rules into code that objectively measures and reports levels of data quality based on user expectations. expand
|
|
|
SESSION: Industry session 2: data mining and federated systems |
|
|
|
|
Comparison of interestingness functions for learning web usage patterns |
| |
experimentation Huang,
Nick Cercone,
Aijun An
|
|
Pages: 617-620 |
|
doi>10.1145/584792.584896 |
|
Full text: PDF
|
|
Livelink is a collaborative intranet, extranet and e-business application that enables employees and business partners of an organization to capture, share and reuse business information and knowledge. The usage of the Livelink software has been recorded ...
Livelink is a collaborative intranet, extranet and e-business application that enables employees and business partners of an organization to capture, share and reuse business information and knowledge. The usage of the Livelink software has been recorded by the Livelink Web server in its log files. We present an application of data mining techniques to the Livelink Web usage data. In particular, we focus on how to find interesting association rules and sequential patterns from the Livelink log files. A number of interestingness measures are used in our application to identify interesting rules and patterns. We present a comparison of these measures based on the feedback from domain experts. Some of the interestingness measures are found to be better than others. expand
|
|
|
The verity federated infrastructure |
| |
Kiam Choo,
Rajat Mukherjee,
Rami Smair,
Wei Zhang
|
|
Pages: 621-621 |
|
doi>10.1145/584792.584897 |
|
Full text: PDF
|
|
In the course of researching a subject, it is often necessary to submit the same search request to multiple heterogeneous information sources in order to (a) aggregate as much information as possible, and (b) integrate different aspects of the subject ...
In the course of researching a subject, it is often necessary to submit the same search request to multiple heterogeneous information sources in order to (a) aggregate as much information as possible, and (b) integrate different aspects of the subject into a coherent report. While it is clear that there is value in providing a federated search solution to make dealing with multiple sources less time-consuming, not all organizations aggregate from the same sources, and once the information has been retrieved, not all organizations want them to be integrated in the same way.The Verity Federated Infrastructure addresses this problem by providing a flexible framework for adding new sources and customizing the way in which results are integrated, post-processed and presented. A new source is made available by writing a Java module called a worker that abides by the search interface of the source. Sources can range from simple information feeds to more complex applications, e.g., CRM systems, relational databases, etc. Workers also perform post-processing on the results returned by other workers, e.g., to provide uniform scores for results from different sources, filtering, etc. This post-processing enables different results to be integrated into a coherent report. Post-processing is triggered by events that propagate between workers and is done asynchronously in the background while results are being viewed. This ability to do background post-processing allows execution of time-consuming operations that provide substantial value without adversely affecting user experience. Finally, search results are returned and viewed incrementally, which enables searching of peer-to-peer networks via peer workers that we have developed. expand
|
|
|
Automatically classifying database workloads |
| |
Said Elnaffar,
Pat Martin,
Randy Horman
|
|
Pages: 622-624 |
|
doi>10.1145/584792.584898 |
|
Full text: PDF
|
|
The type of the workload on a database management system (DBMS) is a key consideration in tuning the system. Allocations for resources such as main memory can be very different depending on whether the workload type is Online Transaction Processing (OLTP) ...
The type of the workload on a database management system (DBMS) is a key consideration in tuning the system. Allocations for resources such as main memory can be very different depending on whether the workload type is Online Transaction Processing (OLTP) or Decision Support System (DSS). In this paper, we present an approach to automatically identifying a DBMS workload as either OLTP or DSS. We build a classification model based on the most significant workload characteristics that differentiate OLTP from DSS, and then use the model to identify any change in the workload type. We construct a workload classifier from the Browsing and Ordering profiles of the TPC-W benchmark. Experiments with an industry-supplied workload show that our classifier accurately identifies the mix of OLTP and DSS work within an application workload. expand
|
|
|
SESSION: Industry session 3: database performance and interface |
|
|
|
|
A mapping mechanism to support bitmap index and other auxiliary structures on tables stored as primary B+-trees |
| |
Eugene Inseok Chong,
Jagannathan Srinivasan,
Souripriya Das,
Chuck Freiwald,
Aravind Yalamanchi,
Mahesh Jagannath,
Anh-Tuan Tran,
Ramkumar Krishnan,
Richard Jiang
|
|
Pages: 625-628 |
|
doi>10.1145/584792.584900 |
|
Full text: PDF
|
|
Any auxiliary structure, such as a bitmap or a B+-tree index, that refers to rows of a table stored as a primary B+-tree (e.g., tables with clustered index in Microsoft SQL Server, or index-organized tables in Oracle) ...
Any auxiliary structure, such as a bitmap or a B+-tree index, that refers to rows of a table stored as a primary B+-tree (e.g., tables with clustered index in Microsoft SQL Server, or index-organized tables in Oracle) by their physical addresses would require updates due to inherent volatility of those addresses. To address this problem, we propose a mapping mechanism that 1) introduces a single mapping table, with each row holding one key value from the primary B+-tree, as an intermediate structure between the primary B+-tree and the associated auxiliary structures, and 2) augments the primary B+-tree structure to include in each row the physical address of the corresponding mapping table row. The mapping table row addresses can then be used in the auxiliary structures to indirectly refer to the primary B+-tree rows. The two key benefits are: 1) the mapping table shields the auxiliary structures from the volatility of the primary B+-tree row addresses, and 2) the method allows reuse of existing conventional table mechanisms for supporting auxiliary structures on primary B+-trees. The mapping mechanism is used for supporting bitmap indexes on index-organized tables in Oracle9i. The analytical and experimental studies show that the method is storage efficient, and (despite the mapping table overhead) provides performance benefits that are similar to those provided by bitmap indexes implemented on conventional tables. expand
|
|
|
Using specification-driven concepts for distributed data management and dissemination |
| |
M. Brian Blake
|
|
Pages: 629-631 |
|
doi>10.1145/584792.584901 |
|
Full text: PDF
|
|
At the MITRE Corporation-Center for Advanced Aviation System Development (CAASD), software engineers work closely with both analyst and domain experts to develop software simulations in the air traffic management domain. In this environment, software ...
At the MITRE Corporation-Center for Advanced Aviation System Development (CAASD), software engineers work closely with both analyst and domain experts to develop software simulations in the air traffic management domain. In this environment, software simulations are applications that take large amounts of real-world operational information, and through calculations, derivations, and display extends the original information to produce some new insight into the domain. This new insight or knowledge typically comes in the form of a pertinent set of data. Based on this set of information other research groups can further extend this knowledge. The challenge in this environment is a distributed data management system that will allow a distributed set of researchers to share their extended knowledge. This paper presents the motivation and design of such an architecture to support this collaborative knowledge/data sharing environment. This run-time configurable architecture is implemented using web-based technologies such as the Extensible Markup Language (XML), Java Servlets, Extensible Stylesheets (XSL), and a relational database management system (RDBMS). expand
|
|
|
SESSION: Poster session |
|
|
|
|
A new cache replacement algorithm for the integration of web caching and prefectching |
| |
Cheng-Yue Chang,
Ming-Syan Chen
|
|
Pages: 632-634 |
|
doi>10.1145/584792.584903 |
|
Full text: PDF
|
|
Web caching and Web prefetching are two important techniques to reduce the noticeable response time perceived by users. Note that by integrating Web caching and Web prefetching, these two techniques can complement each other since Web caching technique ...
Web caching and Web prefetching are two important techniques to reduce the noticeable response time perceived by users. Note that by integrating Web caching and Web prefetching, these two techniques can complement each other since Web caching technique exploits the temporal locality whereas Web prefetching technique utilizes the spatial locality of Web objects. However, without circumspect design, the integration of these two techniques might cause significant performance degradation to each other. In view of this, we propose in this paper an innovative cache replacement algorithm, which not only considers the caching effect in the Web environment but also evaluates the prefetching rules provided by various prefetching schemes. Specifically, we formulate a normalized profit function to evaluate the profit from caching an object (i.e., either a non-implied object or an implied object according to some prefetching rule). Based on the normalized profit function devised, we devise an innovative Web cache replacement algorithm, referred to as algorithm IWCP (standing for the Integration of Web Caching and Prefetching). Using an event-driven simulation, we evaluate the performance of algorithm IWCP under several circumstances. The experimental results show that algorithm IWCP consistently outperforms the companion schemes in various performance metrics. expand
|
|
|
A syntactic approach for searching similarities within sentences |
| |
Federica Mandreoli,
Riccardo Martoglia,
Paolo Tiberio
|
|
Pages: 635-637 |
|
doi>10.1145/584792.584904 |
|
Full text: PDF
|
|
Textual data is the main electronic form of knowledge representation. Sentences, meant as logic units of meaningful word sequences, can be considered its backbone. In this paper, we propose a solution based on a purely syntactic approach for searching ...
Textual data is the main electronic form of knowledge representation. Sentences, meant as logic units of meaningful word sequences, can be considered its backbone. In this paper, we propose a solution based on a purely syntactic approach for searching similarities within sentences, named approximate sub2sequence matching. This process being very time consuming, efficiency in retrieving the most similar parts available in large repositories of textual data is ensured by making use of new filtering techniques. As far as the design of the system is concerned, we chose a solution that allows us to deploy approximate sub2 sequence matching without changing the underlying database. expand
|
|
|
A system for knowledge management in bioinformatics |
| |
Sudeshna Adak,
Vishal S. Batra,
Deo N. Bhardwaj,
P. V. Kamesam,
Pankaj Kankar,
Manish P. Kurhekar,
Biplav Srivastava
|
|
Pages: 638-641 |
|
doi>10.1145/584792.584905 |
|
Full text: PDF
|
|
The emerging biochip technology has made it possible to simultaneously study expression (activity level) of thousands of genes or proteins in a single experiment in the laboratory. However, in order to extract relevant biological knowledge from the biochip ...
The emerging biochip technology has made it possible to simultaneously study expression (activity level) of thousands of genes or proteins in a single experiment in the laboratory. However, in order to extract relevant biological knowledge from the biochip experimental data, it is critical not only to analyze the experimental data, but also to cross-reference and correlate these large volumes of data with information available in external biological databases accessible online. We address this problem in a comprehensive system for knowledge management in bioinformatics called e2e. To the biologist or biological applications, e2e exposes a common semantic view of inter-relationship among biological concepts in the form of an XML representation called eXpressML, while internally, it can use any data integration solution to retrieve data and return results corresponding to the semantic view. We have implemented an e2e prototype that enables a biologist to analyze her gene expression data in GEML or from a public site like Stanford, and discover knowledge through operations like querying on relevant annotated data represented in eXpressML using pathways data from KEGG, publication data from Medline and protein data from SWISS-PROT. expand
|
|
|
An agent-based approach to knowledge management |
| |
Bin Yu,
Munindar P. Singh
|
|
Pages: 642-644 |
|
doi>10.1145/584792.584906 |
|
Full text: PDF
|
|
Traditional approaches to knowledge management are essentially limited to document management. However, much knowledge in organizations or communities resides in an informal social network and may be accessed only by asking the right people. This paper ...
Traditional approaches to knowledge management are essentially limited to document management. However, much knowledge in organizations or communities resides in an informal social network and may be accessed only by asking the right people. This paper describes MARS, a multiagent referral system for knowledge management. MARS assigns a software agent to each user. The agents facilitate their users' interactions and help manage their personal social networks. Moreover, the agents cooperate with one another by giving and taking referrals to help their users find the right parties to contact for a specific knowledge need. expand
|
|
|
Features of documents relevant to task- and fact- oriented questions |
| |
Diane Kelly,
Xiao-jun Yuan,
Nicholas J. Belkin,
Vanessa Murdock,
W. Bruce Croft
|
|
Pages: 645-647 |
|
doi>10.1145/584792.584907 |
|
Full text: PDF
|
|
We describe results from an ongoing project that considers question types and document features and their relationship to retrieval techniques. We examine eight document features from the top 25 documents retrieved from 74 questions and find that lists ...
We describe results from an ongoing project that considers question types and document features and their relationship to retrieval techniques. We examine eight document features from the top 25 documents retrieved from 74 questions and find that lists and FAQs occur in more documents judged relevant to task-oriented questions than those judged relevant to fact-oriented questions. expand
|
|
|
Data fusion with estimated weights |
| |
Shengli Wu,
Fabio Crestani
|
|
Pages: 648-651 |
|
doi>10.1145/584792.584908 |
|
Full text: PDF
|
|
This paper proposes an adptive approach for data fusion of information retrieval systems, which exploits estimated performances of all component input systems without relevance judgement or training. The estimation is conducted prior to the fusion but ...
This paper proposes an adptive approach for data fusion of information retrieval systems, which exploits estimated performances of all component input systems without relevance judgement or training. The estimation is conducted prior to the fusion but uses the same data as fusion applies. The experiment shows that our algorithms are competitive with, and often outperform CombMNZ, one of the most effective algorithms in use. expand
|
|
|
Discovering the representative of a search engine |
| |
King-Lup Liu,
Clement Yu,
Weiyi Meng
|
|
Pages: 652-654 |
|
doi>10.1145/584792.584909 |
|
Full text: PDF
|
|
Given a large number of search engines on the Internet, it is difficult for a person to determine which search engines could serve his/her information needs. A common solution is to construct a metasearch engine on top of the search engines. Upon receiving ...
Given a large number of search engines on the Internet, it is difficult for a person to determine which search engines could serve his/her information needs. A common solution is to construct a metasearch engine on top of the search engines. Upon receiving a user query, the metasearch engine sends it to those underlying search engines which are likely to return the desired documents for the query. The selection algorithm used by a metasearch engine to determine whether a search engine should be sent the query typically makes the decision based on the search-engine representative, which contains characteristic information about the database of a search engine. However, an underlying search engine may not be willing to provide the needed information to the metasearch engine. This paper shows that the needed information can be estimated from an uncooperative search engine with good accuracy. Two pieces of information which permit accurate search engine selection are the number of documents indexed by the search engine and the maximum weight of each term. In this paper, we present techniques for the estimation of these two pieces of information. expand
|
|
|
Ginga: a self-adaptive query processing system |
| |
Henrique Paques,
Ling Liu,
Calton Pu
|
|
Pages: 655-658 |
|
doi>10.1145/584792.584910 |
|
Full text: PDF
|
|
|
|
|
High-performing feature selection for text classification |
| |
Monica Rogati,
Yiming Yang
|
|
Pages: 659-661 |
|
doi>10.1145/584792.584911 |
|
Full text: PDF
|
|
This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian ...
This paper reports a controlled study on a large number of filter feature selection methods for text classification. Over 100 variants of five major feature selection criteria were examined using four well-known classification algorithms: a Naive Bayesian (NB) approach, a Rocchio-style classifier, a k-nearest neighbor (kNN) method and a Support Vector Machine (SVM) system. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1), making the new results comparable to published results. We found that feature selection methods based on chi2 statistics consistently outperformed those based on other criteria (including information gain) for all four classifiers and both data collections, and that a further increase in performance was obtained by combining uncorrelated and high-performing feature selection methods.The results we obtained using only 3% of the available features are among the best reported, including results obtained with the full feature set. expand
|
|
|
Index compression vs. retrieval time of inverted files for XML documents |
| |
Norbert Fuhr,
Norbert Gövert
|
|
Pages: 662-664 |
|
doi>10.1145/584792.584912 |
|
Full text: PDF
|
|
Query languages for retrieval of XML documents allow for conditions referring both to the content and the structure of documents. In this paper, we investigate two different approaches for reducing index space of inverted files for XML documents. First, ...
Query languages for retrieval of XML documents allow for conditions referring both to the content and the structure of documents. In this paper, we investigate two different approaches for reducing index space of inverted files for XML documents. First, we consider methods for compressing index entries. Second, we develop the new XS tree data structure which contains the structural description of a document in a rather compact form, such that these descriptions can be kept in main memory. Experimental results on two large XML document collections show that very high compression rates for indexes can be achieved, but any compression increases retrieval time. On the other hand, highly compressed indexes may be feasible for applications where storage is limited, such as in PDAs or E-book devices. expand
|
|
|
Interactive methods for taxonomy editing and validation |
| |
Scott Spangler,
Jeffrey Kreulen
|
|
Pages: 665-668 |
|
doi>10.1145/584792.584913 |
|
Full text: PDF
|
|
Taxonomies are meaningful hierarchical categorizations of documents into topics reflecting the natural relationships between the documents and their business objectives. Improving the quality of these taxonomies and reducing the overall cost required ...
Taxonomies are meaningful hierarchical categorizations of documents into topics reflecting the natural relationships between the documents and their business objectives. Improving the quality of these taxonomies and reducing the overall cost required to create them is an important area of research. Supervised and unsupervised text clustering are important technologies that comprise only a part of a complete solution. However, there exists a great need for the ability for a human to efficiently interact with a taxonomy during the editing and validation phase. We have developed a comprehensive approach to solving this problem, and implemented this approach in a software tool called eClassifier. eClassifier provides features to help the taxonomy editor understand and evaluate each category of a taxonomy and visualize the relationships between the categories. Multiple techniques allow the user to make changes at both the category and document level. Metrics then establish how well the resultant taxonomy can be modeled for future document classification. In this paper, we present a comprehensive set of viewing, editing and validation techniques we have implemented in the Lotus Discovery Server resulting in a significant reduction in the time required to create a quality taxonomy. expand
|
|
|
Knowledge discovery from texts: a concept frame graph approach |
| |
Kanagasabai Rajaraman,
Ah-Hwee Tan
|
|
Pages: 669-671 |
|
doi>10.1145/584792.584914 |
|
Full text: PDF
|
|
We address the text content mining problem through a concept based framework by constructing a conceptual knowledge base and discovering knowledge therefrom. Defining a novel representation called the Concept Frame Graph (CFG), we propose a learning ...
We address the text content mining problem through a concept based framework by constructing a conceptual knowledge base and discovering knowledge therefrom. Defining a novel representation called the Concept Frame Graph (CFG), we propose a learning algorithm for constructing a CFG knowledge base from text documents. An interactive concept map visualization technique is presented for user-guided knowledge discovery from the knowledge base. Through experimental studies on real life documents, we observe that the proposed approach is promising for mining deeper knowledge. expand
|
|
|
Knowledge discovery in patent databases |
| |
Konstantinos Markellos,
Katerina Perdikuri,
Penelope Markellou,
Spiros Sirmakessis,
George Mayritsakis,
Athanasios Tsakalidis
|
|
Pages: 672-674 |
|
doi>10.1145/584792.584915 |
|
Full text: PDF
|
|
In our days the business, scientific and personal databases are growing in an exponential rate. However, what is truly valuable is the knowledge that can be extracted from the stored data. Knowledge Discovery in patent databases was traditionally based ...
In our days the business, scientific and personal databases are growing in an exponential rate. However, what is truly valuable is the knowledge that can be extracted from the stored data. Knowledge Discovery in patent databases was traditionally based on manual analysis carried out from statistical experts. Nowadays the increasing interest of many actors have led to the development of new tools for discovering and exploiting information related to technological activities and innovation, "hidden" in patent databases. In this paper we present a system that combines efficient and innovative methodologies and tools for the analysis of patent data stored in international databases and the production of scientific and technological indicators. expand
|
|
|
Web-DL: an experience in building digital libraries from the web |
| |
Pável P. Calado,
Altigran S. da Silva,
Berthier Ribeiro-Neto,
Alberto H. F. Laender,
Juliano P. Lage,
Davi C. Reis,
Pablo A. Roberto,
Monique V. Vieira,
Marcos A. Gonçalves,
Edward A. Fox
|
|
Pages: 675-677 |
|
doi>10.1145/584792.584916 |
|
Full text: PDF
|
|
The Web contains a huge volume of information, almost all unstructured and, therefore, difficult to manage. In Digital Libraries, however, information is explicitly organized, described, and managed. In this paper, we propose an architecture that allows ...
The Web contains a huge volume of information, almost all unstructured and, therefore, difficult to manage. In Digital Libraries, however, information is explicitly organized, described, and managed. In this paper, we propose an architecture that allows the construction of digital libraries from the Web, using standard protocols and archival technologies, and incorporating powerful digital library and data extraction tools, thus benefiting from the breadth of the Web contents, but supporting services and organization available in digital libraries. The proposed architecture was applied to the Networked Digital Library of Theses and Dissertations, providing an important first step toward rapid construction of large DLs from the Web, as well as a large-scale solution for interoperability between independent digital libraries. expand
|
|
|
Mining coverage statistics for websource selection in a mediator |
| |
Zaiqing Nie,
Ullas Nambiar,
Sreelakshmi Vaddi,
Subbarao Kambhampati
|
|
Pages: 678-680 |
|
doi>10.1145/584792.584917 |
|
Full text: PDF
|
|
Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition there are no effective approaches for learning the needed statistics. ...
Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. Naive approaches can become infeasible very quickly. In this paper we present a set of connected techniques that estimate the coverage and overlap statistics while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries, and threshold based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics. expand
|
|
|
Mining soft-matching association rules |
| |
Un Yong Nahm,
Raymond J. Mooney
|
|
Pages: 681-683 |
|
doi>10.1145/584792.584918 |
|
Full text: PDF
|
|
Variation and noise in database entries can prevent data mining algorithms, such as association rule mining, from discovering important regularities. In particular, textual fields can exhibit variation due to typographical errors, mispellings, abbreviations, ...
Variation and noise in database entries can prevent data mining algorithms, such as association rule mining, from discovering important regularities. In particular, textual fields can exhibit variation due to typographical errors, mispellings, abbreviations, etc.. By allowing partial or "soft matching" of items based on a similarity metric such as edit-distance or cosine similarity, additional important patterns can be detected. This paper introduces an algorithm, SoftApriori that discovers soft-matching association rules given a user-supplied similarity metric for each field. Experimental results on several "noisy" datasets extracted from text demonstrate that SoftApriori discovers additional relationships that more accurately reflect regularities in the data. expand
|
|
|
Parallelizing the buckshot algorithm for efficient document clustering |
| |
Eric C. Jensen,
Steven M. Beitzel,
Angelo J. Pilotto,
Nazli Goharian,
Ophir Frieder
|
|
Pages: 684-686 |
|
doi>10.1145/584792.584919 |
|
Full text: PDF
|
|
We present a parallel implementation of the Buckshot document clustering algorithm. We demonstrate that this parallel approach is highly efficient both in terms of load balancing and minimization of communication. In a series of experiments using the ...
We present a parallel implementation of the Buckshot document clustering algorithm. We demonstrate that this parallel approach is highly efficient both in terms of load balancing and minimization of communication. In a series of experiments using the 2GB of SGML data from TReC disks 4 and 5, our parallel approach was shown to be scalable in terms of processors efficiently used and the number of clusters created. expand
|