|
|
Mining quantitative association rules in large relational tables |
| |
Ramakrishnan Srikant,
Rakesh Agrawal
|
|
Pages: 1-12 |
|
doi>10.1145/233269.233311 |
|
Full text: PDF
|
|
We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be "10% of married people between age 50 and 60 have at least 2 cars". We ...
We introduce the problem of mining association rules in large relational tables containing both quantitative and categorical attributes. An example of such an association might be "10% of married people between age 50 and 60 have at least 2 cars". We deal with quantitative attributes by fine-partitioning the values of the attribute and then combining adjacent partitions as necessary. We introduce measures of partial completeness which quantify the information lost due to partitioning. A direct application of this technique can generate too many similar rules. We tackle this problem by using a "greater-than-expected-value" interest measure to identify the interesting rules in the output. We give an algorithm for mining such quantitative association rules. Finally, we describe the results of using this approach on a real-life dataset. expand
|
|
|
Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization |
| |
Takeshi Fukuda,
Yasukiko Morimoto,
Shinichi Morishita,
Takeshi Tokuyama
|
|
Pages: 13-23 |
|
doi>10.1145/233269.233313 |
|
Full text: PDF
|
|
We discuss data mining based on association rules for two numeric attributes and one Boolean attribute. For example, in a database of bank customers, "Age" and "Balance" are two numeric attributes, and "CardLoan" is a Boolean attribute. Taking the pair ...
We discuss data mining based on association rules for two numeric attributes and one Boolean attribute. For example, in a database of bank customers, "Age" and "Balance" are two numeric attributes, and "CardLoan" is a Boolean attribute. Taking the pair (Age, Balance) as a point in two-dimensional space, we consider an association rule of the form((Age, Balance) ∈ P) ⇒ (CardLoan = Yes),which implies that bank customers whose ages and balances fall in a planar region P tend to use card loan with a high probability. We consider two classes of regions, rectangles and admissible (i.e. connected and x-monotone) regions. For each class, we propose efficient algorithms for computing the regions that give optimal association rules for gain, support, and confidence, respectively. We have implemented the algorithms for admissible regions, and constructed a system for visualizing the rules. expand
|
|
|
IDEA: interactive data exploration and analysis |
| |
Peter G. Selfridge,
Divesh Srivastava,
Lynn O. Wilson
|
|
Pages: 24-34 |
|
doi>10.1145/233269.233315 |
|
Full text: PDF
|
|
The analysis of business data is often an ill-defined task characterized by large amounts of noisy data. Because of this, business data analysis must combine two kinds of intertwined tasks: exploration and analysis. Exploration is the process ...
The analysis of business data is often an ill-defined task characterized by large amounts of noisy data. Because of this, business data analysis must combine two kinds of intertwined tasks: exploration and analysis. Exploration is the process of finding the appropriate subset of data to analyze, and analysis is the process of measuring the data to provide the business answer. While there are many tools available both for exploration and for analysis, a single tool or set of tools may not provide full support for these intertwined tasks. We report here on a project that set out to understand a specific business data analysis problem and build an environment to support it. The results of this understanding are, first of all, a detailed list of requirements of this task; second, a set of capabilities that meet these requirements; and third, an implemented client-server solution that addresses many of these requirements and identifies others for future work. Our solution incorporates several novel perspectives on data analysis and combines a history mechanism with a graphical, re-usable representation of the analysis and exploration process. Our approach emphasizes using the database itself to represent as many of these functions as possible. expand
|
|
|
Rapid bushy join-order optimization with Cartesian products |
| |
Bennet Vance,
David Maier
|
|
Pages: 35-46 |
|
doi>10.1145/233269.233317 |
|
Full text: PDF
|
|
Query optimizers often limit the search space for join orderings, for example by excluding Cartesian products in subplans or by restricting plan trees to left-deep vines. Such exclusions are widely assumed to reduce optimization effort while minimally ...
Query optimizers often limit the search space for join orderings, for example by excluding Cartesian products in subplans or by restricting plan trees to left-deep vines. Such exclusions are widely assumed to reduce optimization effort while minimally affecting plan quality. However, we show that searching the complete space of plans is more affordable than has been previously recognized, and that the common exclusions may be of little benefit.We start by presenting a Cartesian product optimizer that requires at most a few seconds of workstation time to search the space of bushy plans for products of up to 15 relations. Building on this result, we present a join-order optimizer that achieves a similar level of performance, and retains the ability to include Cartesian products in subplans wherever appropriate. The main contribution of the paper is in fully separating join-order enumeration from predicate analysis, and in showing that the former problem in particular can be solved swiftly by novel implementation techniques. A secondary contribution is to initiate a systematic approach to the benchmarking of join-order optimization, which we apply to the evaluation of our method. expand
|
|
|
SQL query optimization: reordering for a general class of queries |
| |
Piyush Goel,
Bala Iyer
|
|
Pages: 47-56 |
|
doi>10.1145/233269.233318 |
|
Full text: PDF
|
|
The strength of commercial query optimizers like DB2 comes from their ability to select an optimal order by generating all equivalent reorderings of binary operators. However, there are no known methods to generate all equivalent reorderings for a SQL ...
The strength of commercial query optimizers like DB2 comes from their ability to select an optimal order by generating all equivalent reorderings of binary operators. However, there are no known methods to generate all equivalent reorderings for a SQL query containing joins, outer joins, and groupby aggregations. Consequently, some of the reorderings with significantly lower cost may be missed. Using hypergraph model and a set of novel identities, we propose a method to reorder a SQL query containing joins, outer joins, and groupby aggregations. While these operators are sufficient to capture the SQL semantics, it is during their reordering that we identify a powerful primitive needed for a dbms. We report our findings of a simple, yet fundamental operator, generalized selection, and demonstrate its power to solve the problem of reordering of SQL queries containing joins, outer joins, and groupby aggregations. expand
|
|
|
Fundamental techniques for order optimization |
| |
David Simmen,
Eugene Shekita,
Timothy Malkemus
|
|
Pages: 57-67 |
|
doi>10.1145/233269.233320 |
|
Full text: PDF
|
|
Decision support applications are growing in popularity as more business data is kept on-line. Such applications typically include complex SQL queries that can test a query optimizer's ability to produce an efficient access plan. Many access plan strategies ...
Decision support applications are growing in popularity as more business data is kept on-line. Such applications typically include complex SQL queries that can test a query optimizer's ability to produce an efficient access plan. Many access plan strategies exploit the physical ordering of data provided by indexes or sorting. Sorting is an expensive operation, however. Therefore, it is imperative that sorting is optimized in some way or avoided all together. Toward that goal, this paper describes novel optimization techniques for pushing down sorts in joins, minimizing the number of sorting columns, and detecting when sorting can be avoided because of predicates, keys, or indexes. A set of fundamental operations is described that provide the foundation for implementing such techniques. The operations exploit data properties that arise from predicate application, uniqueness, and functional dependencies. These operations and techniques have been implemented in IBM's DB2/CS. expand
|
|
|
A Teradata content-based multimedia object manager for massively parallel architectures |
| |
W. O'Connell,
I. T. Ieong,
D. Schrader,
C. Watson,
G. Au,
A. Biliris,
S. Choo,
P. Colin,
G. Linderman,
E. Panagos,
J. Wang,
T. Walter
|
|
Pages: 68-78 |
|
doi>10.1145/233269.233321 |
|
Full text: PDF
|
|
The Teradata Multimedia Object Manager is a general-purpose content analysis multimedia server designed for symmetric multiprocessing and massively parallel processing environments. The Multimedia Object Manager defines and manipulates user-defined functions ...
The Teradata Multimedia Object Manager is a general-purpose content analysis multimedia server designed for symmetric multiprocessing and massively parallel processing environments. The Multimedia Object Manager defines and manipulates user-defined functions (UDFs), which are invoked in parallel to analyze or manipulate the contents of multimedia objects. Several computationally intensive applications of this technology, which use large persistent datasets, include fingerprint matching, signature verification, face recognition, and speech recognition/translation. expand
|
|
|
Fault-tolerant architectures for continuous media servers |
| |
Banu Özden,
Rajeev Rastogi,
Prashant Shenoy,
Avi Silberschatz
|
|
Pages: 79-90 |
|
doi>10.1145/233269.233322 |
|
Full text: PDF
|
|
Continuous media servers that provide support for the storage and retrieval of continuous media data (e.g., video, audio) at guaranteed rates are becoming increasingly important. Such servers, typically, rely on several disks to service a large number ...
Continuous media servers that provide support for the storage and retrieval of continuous media data (e.g., video, audio) at guaranteed rates are becoming increasingly important. Such servers, typically, rely on several disks to service a large number of clients, and are thus highly susceptible to disk failures. We have developed two fault-tolerant approaches that rely on admission control in order to meet rate guarantees for continuous media requests. The schemes enable data to be retrieved from disks at the required rate even if a certain disk were to fail. For both approaches, we present data placement strategies and admission control algorithms. We also present design techniques for maximizing the number of clients that can be supported by a continuous media server. Finally, through extensive simulations, we demonstrate the effectiveness of our schemes. expand
|
|
|
Optimizing queries over multimedia repositories |
| |
Surajit Chaudhuri,
Luis Gravano
|
|
Pages: 91-102 |
|
doi>10.1145/233269.233323 |
|
Full text: PDF
|
|
Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A selection on these attributes will typically produce not just a set of objects, as in the traditional relational query model ...
Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A selection on these attributes will typically produce not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, indicating how well the object matches the selection condition (ranking). Also, multimedia repositories may allow access to the attributes of each object only through indexes. We investigate how to optimize the processing of queries over multimedia repositories. A key issue is the choice of the indexes used to search the repository. We define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. Although the general problem of picking an optimal plan in the search-minimal execution space is NP-hard, we solve the problem efficiently when the predicates in the query are independent. We also show that the problem of optimizing queries that ask for a few top-ranked objects can be viewed, in many cases, as that of evaluating selection conditions. Thus, both problems can be viewed together as an extended filtering problem. expand
|
|
|
BIRCH: an efficient data clustering method for very large databases |
| |
Tian Zhang,
Raghu Ramakrishnan,
Miron Livny
|
|
Pages: 103-114 |
|
doi>10.1145/233269.233324 |
|
Full text: PDF
|
|
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior ...
Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior. expand
|
|
|
On-line reorganization of sparsely-populated B+-trees |
| |
Chendong Zou,
Betty Salzberg
|
|
Pages: 115-124 |
|
doi>10.1145/233269.233325 |
|
Full text: PDF
|
|
In this paper, we present an efficient method to do online reorganization of sparsely-populated B+-trees. It reorganizes the leaves first, compacting in short operations groups of leaves with the same parent. After compacting, optionally, ...
In this paper, we present an efficient method to do online reorganization of sparsely-populated B+-trees. It reorganizes the leaves first, compacting in short operations groups of leaves with the same parent. After compacting, optionally, the new leaves may swap locations or be moved into empty pages so that they are in key order on the disk. After the leaves are reorganized, the method shrinks the tree by making a copy of the upper part of the tree while leaving the leaves in place. A new concurrency method is introduced so that only a minimum number of pages are locked during reorganization. During leaf reorganization, Forward Recovery is used to save all work already done while maintaining consistency after system crashes. A heuristic algorithm is developed to reduce the number of swaps needed during leaf reorganization, so that better concurrency and easier recovery can be achieved. A detailed description of switching from the old B+-tree to the new B+-tree is described for the first time. expand
|
|
|
Two techniques for on-line index modification in shared nothing parallel databases |
| |
Kiran J. Achyutuni,
Edward Omiecinski,
Shamkant B. Navathe
|
|
Pages: 125-136 |
|
doi>10.1145/233269.233326 |
|
Full text: PDF
|
|
Whenever data is moved across nodes in the parallel database system, the indexes need to be modified too. Index modification overhead can be quite severe because there can be a large number of indexes on a relation. In this paper, we study two alternatives ...
Whenever data is moved across nodes in the parallel database system, the indexes need to be modified too. Index modification overhead can be quite severe because there can be a large number of indexes on a relation. In this paper, we study two alternatives to index modification, namely OAT (One-At-a-Time page movement) and BULK (bulk page movement). OAT and BULK are two extremes on the spectrum of the granularity of data movement. OAT and BULK differ in two respects: first, OAT uses very little additional disk space (at most one extra page), whereas BULK uses a large amount of disk space. Second, BULK uses sequential prefetch I/O to optimize on the number of I/Os during index modification, while OAT does not. Using an experimental testbed, we show that BULK is an order of magnitude faster than OAT. In terms of the impact on transaction performance during reorganization, BULK and OAT perform differently: when the number of indexes to be modified is either one or two, OAT has a lesser impact on the transaction performance degradation. However, when the number of indexes is greater than two, both techniques have the same impact on transaction performance. expand
|
|
|
Query caching and optimization in distributed mediator systems |
| |
S. Adali,
K. S. Candan,
Y. Papakonstantinou,
V. S. Subrahmanian
|
|
Pages: 137-146 |
|
doi>10.1145/233269.233327 |
|
Full text: PDF
|
|
Query processing and optimization in mediator systems that access distributed non-proprietary sources pose many novel problems. Cost-based query optimization is hard because the mediator does not have access to source statistics information and furthermore ...
Query processing and optimization in mediator systems that access distributed non-proprietary sources pose many novel problems. Cost-based query optimization is hard because the mediator does not have access to source statistics information and furthermore it may not be easy to model the source's performance. At the same time, querying remote sources may be very expensive because of high connection overhead, long computation time, financial charges, and temporary unavailability. We propose a cost-based optimization technique that caches statistics of actual calls to the sources and consequently estimates the cost of the possible execution plans based on the statistics cache. We investigate issues pertaining to the design of the statistics cache and experimentally analyze various tradeoffs. We also present a query result caching mechanism that allows us to effectively use results of prior queries when the source is not readily available. We employ the novel invariants mechanism, which shows how semantic information about data sources may be used to discover cached query results of interest. expand
|
|
|
Performance tradeoffs for client-server query processing |
| |
Michael J. Franklin,
Björn Thór Jónsson,
Donald Kossmann
|
|
Pages: 149-160 |
|
doi>10.1145/233269.233328 |
|
Full text: PDF
|
|
The construction of high-performance database systems that combine the best aspects of the relational and object-oriented approaches requires the design of client-server architectures that can fully exploit client and server resources in a flexible manner. ...
The construction of high-performance database systems that combine the best aspects of the relational and object-oriented approaches requires the design of client-server architectures that can fully exploit client and server resources in a flexible manner. The two predominant paradigms for client-server query execution are data-shipping and query-shipping We first define these policies in terms of the restrictions they place on operator site selection during query optimization. We then investigate the performance tradeoffs between them for bulk query processing. While each strategy has advantages, neither one on its own is efficient across a wide range of circumstances. We describe and evaluate a more flexible policy called hybrid-shipping, which can execute queries at clients, servers, or any combination of the two. Hybrid-shipping is shown to at least match the best of the two "pure" policies, and in some situations, to perform better than both. The implementation of hybrid-shipping raises a number of difficult problems for query optimization. We describe an initial investigation into the use of a 2-step query optimization strategy as a way of addressing these issues. expand
|
|
|
Data access for the masses through OLE DB |
| |
José A. Blakeley
|
|
Pages: 161-172 |
|
doi>10.1145/233269.233329 |
|
Full text: PDF
|
|
This paper presents an overview of OLE DB, a set of interfaces being developed at Microsoft whose goal is to enable applications to have uniform access to data stored in DBMS and non-DBMS information containers. Applications will be able to take advantage ...
This paper presents an overview of OLE DB, a set of interfaces being developed at Microsoft whose goal is to enable applications to have uniform access to data stored in DBMS and non-DBMS information containers. Applications will be able to take advantage of the benefits of database technology without having to transfer data from its place of origin to a DBMS. Our approach consists of defining an open, extensible Collection of interfaces that factor and encapsulate orthogonal, reusable portions of DBMS functionality. These interfaces define the boundaries of DBMS components such as record containers, query processors, and transaction coordinators that enable uniform, transactional access to data among such components. The proposed interfaces extend Microsoft's OLE/COM object services framework with database functionality, hence these interfaces are collectively referred to as OLE DB. The OLE DB functional areas include data access and updates (rowsets), query processing, schema information, notifications, transactions, security, and access to remote data. In a sense, OLE DB represents an effort to bring database technology to the masses. This paper presents an overview of the OLE DB approach and its areas of componentization. expand
|
|
|
The dangers of replication and a solution |
| |
Jim Gray,
Pat Helland,
Patrick O'Neil,
Dennis Shasha
|
|
Pages: 173-182 |
|
doi>10.1145/233269.233330 |
|
Full text: PDF
|
|
Update anywhere-anytime-anyway transactional replication has unstable behavior as the workload scales up: a ten-fold increase in nodes and traffic gives a thousand fold increase in deadlocks or reconciliations. Master copy replication (primary copy) ...
Update anywhere-anytime-anyway transactional replication has unstable behavior as the workload scales up: a ten-fold increase in nodes and traffic gives a thousand fold increase in deadlocks or reconciliations. Master copy replication (primary copy) schemes reduce this problem. A simple analytic model demonstrates these results. A new two-tier replication algorithm is proposed that allows mobile (disconnected) applications to propose tentative update transactions that are later applied to a master copy. Commutative update transactions avoid the instability of other replication schemes. expand
|
|
|
Hot mirroring: a method of hiding parity update penalty and degradation during rebuilds for RAID5 |
| |
Kazuhiko Mogi,
Masaru Kitsuregawa
|
|
Pages: 183-194 |
|
doi>10.1145/233269.233331 |
|
Full text: PDF
|
|
This paper proposes a storage management scheme for disk arrays, named hot mirroring. In this scheme, storage space is partitioned into two regions. One is the mirrored region, which is characterized by high performance and low storage efficiency. The ...
This paper proposes a storage management scheme for disk arrays, named hot mirroring. In this scheme, storage space is partitioned into two regions. One is the mirrored region, which is characterized by high performance and low storage efficiency. The other is the RAID5 region, which is characterized by low performance and high storage efficiency. Hot data blocks are stored in the former area, while cold blocks are stored in the latter. In addition, mirrored pairs and RAID5 stripes are orthogonally laid out, through which the performance degradation during rebuilding is minimized. Hot block clustering in hot mirroring achieves higher performance than conventional RAID5 arrays. The potential of hot mirroring is examined through extensive simulation. expand
|
|
|
Random I/O scheduling in online tertiary storage systems |
| |
Bruce K. Hillyer,
Avi Silberschatz
|
|
Pages: 195-204 |
|
doi>10.1145/233269.233332 |
|
Full text: PDF
|
|
New database applications that require the storage and retrieval of many terabytes of data are reaching the limits for disk-based storage systems, in terms of both cost and scalability. These limits provide a strong incentive for the development of databases ...
New database applications that require the storage and retrieval of many terabytes of data are reaching the limits for disk-based storage systems, in terms of both cost and scalability. These limits provide a strong incentive for the development of databases that augment disk storage with technologies better suited to large volumes of data. In particular, the seamless incorporation of tape storage into database systems would be of great value. Tape storage is two orders of magnitude more efficient than disk in terms of cost per terabyte and physical volume per terabyte; however, a key problem is that the random access latency of tape is three to four orders of magnitude slower than disk. Thus, to incorporate a tape bulk store in an online storage system, the problem of tape access latency must be solved. One approach to reducing the latency is careful I/O scheduling. The focus of this paper is on efficient random I/O scheduling for tape drives that use a serpentine track layout, such as the Quantum DLT and the IBM 3480 and 3590. For serpentine tape, I/O scheduling is problematic because of the complex relationships between logical block numbers, their physical positions on tape, and the time required for tape positioning between these physical positions. The results in this paper show that our scheduling schemes provide a significant improvement in the latency of random access to serpentine tape. expand
|
|
|
Implementing data cubes efficiently |
| |
Venky Harinarayan,
Anand Rajaraman,
Jeffrey D. Ullman
|
|
Pages: 205-216 |
|
doi>10.1145/233269.233333 |
|
Full text: PDF
|
|
Decision support applications involve complex queries on very large databases. Since response times should be small, query optimization is critical. Users typically view the data as multidimensional data cubes. Each cell of the data cube is a view consisting ...
Decision support applications involve complex queries on very large databases. Since response times should be small, query optimization is critical. Users typically view the data as multidimensional data cubes. Each cell of the data cube is a view consisting of an aggregation of interest, like total sales. The values of many of these cells are dependent on the values of other cells in the data cube. A common and powerful query optimization technique is to materialize some or all of these cells rather than compute them from raw data each time. Commercial systems differ mainly in their approach to materializing the data cube. In this paper, we investigate the issue of which cells (views) to materialize when it is too expensive to materialize all views. A lattice framework is used to express dependencies among views. We present greedy algorithms that work off this lattice and determine a good set of views to materialize. The greedy algorithm performs within a small constant factor of optimal under a variety of models. We then consider the most common case of the hypercube lattice and examine the choice of materialized views for hypercubes in detail, giving some good tradeoffs between the space used and the average time to answer a query. expand
|
|
|
Providing better support for a class of decision support queries |
| |
Sudhir G. Rao,
Antonio Badia,
Dirk van Gucht
|
|
Pages: 217-227 |
|
doi>10.1145/233269.233334 |
|
Full text: PDF
|
|
Relational database systems do not effectively support complex queries containing quantifiers (quantified queries) that are increasingly becoming important in decision support applications. Generalized quantifiers provide an effective way ...
Relational database systems do not effectively support complex queries containing quantifiers (quantified queries) that are increasingly becoming important in decision support applications. Generalized quantifiers provide an effective way of expressing such queries naturally. In this paper, we consider the problem of processing quantified queries within the generalized quantifier framework. We demonstrate that current relational systems are ill-equipped, both at the language and at the query processing level, to deal with such queries. We also provide insights into the intrinsic difficulties associated with processing such queries. We then describe the implementation of a quantified query processor, Q2P, that is based on multidimensional and boolean matrix structures. We provide results of performance experiments run on Q2P that demonstrate superior performance on quantified queries. Our results indicate that it is feasible to augment relational systems with query subsystems like Q2P for significant performance benefits for quantified queries in decision support applications. expand
|
|
|
A query language for multidimensional arrays: design, implementation, and optimization techniques |
| |
Leonid Libkin,
Rona Machlin,
Limsoon Wong
|
|
Pages: 228-239 |
|
doi>10.1145/233269.233335 |
|
Full text: PDF
|
|
While much recent research has focussed on extending databases beyond the traditional relational model, relatively little has been done to develop database tools for querying data organized in (multidimensional) arrays. The scientific computing community ...
While much recent research has focussed on extending databases beyond the traditional relational model, relatively little has been done to develop database tools for querying data organized in (multidimensional) arrays. The scientific computing community has made little use of available database technology. Instead, multidimensional scientific data is typically stored in local files conforming to various data exchange formats and queried via specialized access libraries tied in to general purpose programming languages.To allow such data to be queried using known database techniques, we design and implement a query language for multidimensional arrays. Our main design decision is to treat arrays as functions from index sets to values rather than as collection types. This leads to clean syntax and semantics as well as simple but powerful optimization rules.We present a calculus for arrays that extends standard calculi for complex objects. We derive a higher-level comprehension style query language based on this calculus and describe its implementation, including a data driver for the NetCDF data exchange format. Next, we explore some optimization rules obtained from the equational laws of our core calculus. Finally, we study the expressiveness of our calculus and prove that it essentially corresponds to adding ranking to a query language for complex objects. expand
|
|
|
A super scalar sort algorithm for RISC processors |
| |
Ramesh C. Agarwal
|
|
Pages: 240-246 |
|
doi>10.1145/233269.233336 |
|
Full text: PDF
|
|
The compare and branch sequences required in a traditional sort algorithm can not efficiently exploit multiple execution units present in currently available high performance RISC processors. This is because of the long latency of the compare instructions ...
The compare and branch sequences required in a traditional sort algorithm can not efficiently exploit multiple execution units present in currently available high performance RISC processors. This is because of the long latency of the compare instructions and the sequential algorithm used in sorting. With the increased level of integration on a chip, this trend is expected to continue. We have developed new sort algorithms which eliminate almost all the compares, provide functional parallelism which can be exploited by multiple execution units, significantly reduce the number of passes through keys, and improve data locality. These new algorithms outperform traditional sort algorithms by a large factor.For the Datamation disk to disk sort benchmark (one million 100-byte records), at SIGMOD'94, Chris Nyberg et al presented several new performance records using DEC alpha processor based systems.We have implemented the Datamation sort benchmark using our new sort algorithm on a desktop IBM RS/6000 model 39H (66.6 MHz) with 8 IBM SSA 7133 disk drives (total cost $73K). The total elapsed time for the 100 MB sort was 5.1 seconds (vs the old uni-processor record of 9.1 seconds). We have also established a new price performance record (0.2¢ vs the old record of 0.9¢, as the cost of the sort). The entire sort processing was overlapped with I/O. During the read phase, we achieved a sustained BW of 47 MB/sec and during the write phase, we achieved a sustained BW of 39 MB/sec. Key extraction and sorting of one million 10-byte keys took only 0.6 second of CPU time. The rest of the CPU time was used in moving records, servicing I/O, and other overheads.Algorithmic details leading to this level of performance are described in this paper. A detailed analysis of the CPU time spent during various phases of the sort algorithm and I/O is also provided. expand
|
|
|
Spatial hash-joins |
| |
Ming-Ling Lo,
Chinya V. Ravishankar
|
|
Pages: 247-258 |
|
doi>10.1145/233269.233337 |
|
Full text: PDF
|
|
We examine how to apply the hash-join paradigm to spatial joins, and define a new framework for spatial hash-joins. Our spatial partition functions have two components: a set of bucket extents and an assignment function, which may map a data item into ...
We examine how to apply the hash-join paradigm to spatial joins, and define a new framework for spatial hash-joins. Our spatial partition functions have two components: a set of bucket extents and an assignment function, which may map a data item into multiple buckets. Furthermore, the partition functions for the two input datasets may be different.We have designed and tested a spatial hash-join method based on this framework. The partition function for the inner dataset is initialized by sampling the dataset, and evolves as data are inserted. The partition function for the outer dataset is immutable, but may replicate a data item from the outer dataset into multiple buckets. The method mirrors relational hash-joins in other aspects. Our method needs no pre-computed indices. It is therefore applicable to a wide range of spatial joins.Our experiments show that our method outperforms current spatial join algorithms based on tree matching by a wide margin. Further, its performance is superior even when the tree-based methods have pre-computed indices. This makes the spatial hash-join method highly competitive both when the input datasets are dynamically generated and when the datasets have pre-computed indices. expand
|
|
|
Partition based spatial-merge join |
| |
Jignesh M. Patel,
David J. DeWitt
|
|
Pages: 259-270 |
|
doi>10.1145/233269.233338 |
|
Full text: PDF
|
|
This paper describes PBSM (Partition Based Spatial-Merge), a new algorithm for performing spatial join operation. This algorithm is especially effective when neither of the inputs to the join have an index on the joining attribute. Such a situation could ...
This paper describes PBSM (Partition Based Spatial-Merge), a new algorithm for performing spatial join operation. This algorithm is especially effective when neither of the inputs to the join have an index on the joining attribute. Such a situation could arise if both inputs to the join are intermediate results in a complex query, or in a parallel environment where the inputs must be dynamically redistributed. The PBSM algorithm partitions the inputs into manageable chunks, and joins them using a computational geometry based plane-sweeping technique. This paper also presents a performance study comparing the the traditional indexed nested loops join algorithm, a spatial join algorithm based on joining spatial indices, and the PBSM algorithm. These comparisons are based on complete implementations of these algorithms in Paradise, a database system for handling GIS applications. Using real data sets, the performance study examines the behavior of these spatial join algorithms in a variety of situations, including the cases when both, one, or none of the inputs to the join have an suitable index. The study also examines the effect of clustering the join inputs on the performance of these join algorithms. The performance comparisons demonstrates the feasibility, and applicability of the PBSM join algorithm. expand
|
|
|
Bifocal sampling for skew-resistant join size estimation |
| |
Sumit Ganguly,
Phillip B. Gibbons,
Yossi Matias,
Avi Silberschatz
|
|
Pages: 271-281 |
|
doi>10.1145/233269.233340 |
|
Full text: PDF
|
|
This paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same ...
This paper introduces bifocal sampling, a new technique for estimating the size of an equi-join of two relations. Bifocal sampling classifies tuples in each relation into two groups, sparse and dense, based on the number of tuples with the same join value. Distinct estimation procedures are employed that focus on various combinations for joining tuples (e.g., for estimating the number of joining tuples that are dense in both relations). This combination of estimation procedures overcomes some well-known problems in previous schemes, enabling good estimates with no a priori knowledge about the data distribution. The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is Ω(n lg n), for relations consisting of n tuples. The algorithm requires a sample of size at most O(√n lg n). By contrast, previous algorithms using a sample of similar size may require the join size to be Ω(n√n) to guarantee an accurate estimate. Experimental results support the theoretical claims and show that bifocal sampling is practical and effective. expand
|
|
|
Estimating alphanumeric selectivity in the presence of wildcards |
| |
P. Krishnan,
Jeffrey Scott Vitter,
Bala Iyer
|
|
Pages: 282-293 |
|
doi>10.1145/233269.233341 |
|
Full text: PDF
|
|
Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reordering [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that ...
Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reordering [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that satisfy a selection predicate, is key to determining the optimal join order. Previous work has concentrated on estimating selectivity for numeric fields [ASW, HaSa, IoP, LNS, SAC, WVT]. With the popularity of textual data being stored in databases, it has become important to estimate selectivity accurately for alphanumeric fields. A particularly problematic predicate used against alphanumeric fields is the SQL like predicate [Dat]. Techniques used for estimating numeric selectivity are not suited for estimating alphanumeric selectivity.In this paper, we study for the first time the problem of estimating alphanumeric selectivity in the presence of wildcards. Based on the intuition that the model built by a data compressor on an input text encapsulates information about common substrings in the text, we develop a technique based on the suffix tree data structure to estimate alphanumeric selectivity. In a statistics generation pass over the database, we construct a compact suffix tree-based structure from the columns of the database. We then look at three families of methods that utilize this structure to estimate selectivity during query plan costing, when a query with predicates on alphanumeric attributes contains wildcards in the predicate.We evaluate our methods empirically in the context of the TPC-D benchmark. We study our methods experimentally against a variety of query patterns and identify five techniques that hold promise. expand
|
|
|
Improved histograms for selectivity estimation of range predicates |
| |
Viswanath Poosala,
Peter J. Haas,
Yannis E. Ioannidis,
Eugene J. Shekita
|
|
Pages: 294-305 |
|
doi>10.1145/233269.233342 |
|
Full text: PDF
|
|
Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never ...
Many commercial database systems maintain histograms to summarize the contents of relations and permit efficient estimation of query result sizes and access plan costs. Although several types of histograms have been proposed in the past, there has never been a systematic study of all histogram aspects, the available choices for each aspect, and the impact of such choices on histogram effectiveness. In this paper, we provide a taxonomy of histograms that captures all previously proposed histogram types and indicates many new possibilities. We introduce novel choices for several of the taxonomy dimensions, and derive new histogram types by combining choices in effective ways. We also show how sampling techniques can be used to reduce the cost of histogram construction. Finally, we present results from an empirical study of the proposed histogram types used in selectivity estimation of range predicates and identify the histogram types that have the best overall performance. expand
|
|
|
Structures for manipulating proposed updates in object-oriented databases |
| |
Michael Doherty,
Richard Hull,
Mohammed Rupawalla
|
|
Pages: 306-317 |
|
doi>10.1145/233269.233344 |
|
Full text: PDF
|
|
Support for virtual states and deltas between them is useful for a variety of database applications, including hypothetical database access, version management, simulation, and active databases. The Heraclitus paradigm elevates delta values to be "first-class ...
Support for virtual states and deltas between them is useful for a variety of database applications, including hypothetical database access, version management, simulation, and active databases. The Heraclitus paradigm elevates delta values to be "first-class citizens" in database programming languages, so that they can be explicitly created, accessed and manipulated.A fundamental issue concerns the trade-off between the "accuracy" or "robustness" of a form of delta representation, and the ease of access and manipulation of that form. At one end of the spectrum, code-blocks could be used to represent delta values, resulting in a more accurate capture of the intended meaning of a proposed update, at the cost of more expensive access and manipulation. In the context of object-oriented databases, another point on the spectrum is "attribute-granularity" deltas which store the net changes to each modified attribute value of modified objects.This paper introduces a comprehensive framework for specifying a broad array of forms for representing deltas for complex value types (tuple, set, bag, list, o-set and dictionary). In general, the granularity of such deltas can be arbitrarily deep within the complex value structure. Applications of this framework in connection with hypothetical access to, and "merging" of, proposed updates are discussed. expand
|
|
|
Safe and efficient sharing of persistent objects in Thor |
| |
B. Liskov,
A. Adya,
M. Castro,
S. Ghemawat,
R. Gruber,
U. Maheshwari,
A. C. Myers,
M. Day,
L. Shrira
|
|
Pages: 318-329 |
|
doi>10.1145/233269.233346 |
|
Full text: PDF
|
|
Thor is an object-oriented database system designed for use in a heterogeneous distributed environment. It provides highly-reliable and highly-available persistent storage for objects, and supports safe sharing of these objects by applications written ...
Thor is an object-oriented database system designed for use in a heterogeneous distributed environment. It provides highly-reliable and highly-available persistent storage for objects, and supports safe sharing of these objects by applications written in different programming languages.Safe heterogeneous sharing of long-lived objects requires encapsulation: the system must guarantee that applications interact with objects only by invoking methods. Although safety concerns are important, most object-oriented databases forgo safety to avoid paying the associated performance costs.This paper gives an overview of Thor's design and implementation. We focus on two areas that set Thor apart from other object-oriented databases. First, we discuss safe sharing and techniques for ensuring it; we also discuss ways of improving application performance without sacrificing safety. Second, we describe our approach to cache management at client machines, including a novel adaptive prefetching strategy.The paper presents performance results for Thor, on several OO7 benchmark traversals. The results show that adaptive prefetching is very effective, improving both the elapsed time of traversals and the amount of space used in the client cache. The results also show that the cost of safe sharing can be negligible; thus it is possible to have both safety and high performance. expand
|
|
|
An open abstract-object storage system |
| |
Stephen Blott,
Lukas Relly,
Hans-Jörg Schek
|
|
Pages: 330-340 |
|
doi>10.1145/233269.233348 |
|
Full text: PDF
|
|
Database systems must become more open to retain their relevance as a technology of choice and necessity. Openness implies not only databases exporting their data, but also exporting their services. This is as true in classical application areas as in ...
Database systems must become more open to retain their relevance as a technology of choice and necessity. Openness implies not only databases exporting their data, but also exporting their services. This is as true in classical application areas as in non-classical (GIS, multimedia, design, etc).This paper addresses the problem of exporting storage-management services of indexing, replication and basic query processing. We describe an abstract-object storage model which provides the basic mechanism, 'likeness', through which these services are applied uniformly to internally-stored, internally-defined data, and to externally-stored, externally-defined data. Managing external data requires the coupling of external operations to the database system. We discuss the interfaces and protocols required of these to achieve correct resource management and admit efficient realisation. Throughout, we demonstrate our solutions in the area of semi-structured file management; in our case, geospatial metadata files. expand
|
|
|
Static detection of security flaws in object-oriented databases |
| |
Keishi Tajima
|
|
Pages: 341-352 |
|
doi>10.1145/233269.233349 |
|
Full text: PDF
|
|
Access control in function granularity is one of the features of many object-oriented databases. In those systems, the users are granted rights to invoke composed functions instead of rights to invoke primitive operations. Although primitive operations ...
Access control in function granularity is one of the features of many object-oriented databases. In those systems, the users are granted rights to invoke composed functions instead of rights to invoke primitive operations. Although primitive operations are invoked inside composed functions, the users can invoke them only through the granted functions. This achieves access control in abstract operation level. Access control utilizing encapsulated functions, however, easily causes many "security flaws" through which malicious users can bypass the encapsulation and can abuse the primitive operations inside the functions. In this paper, we develop a technique to statically detect such security flaws. First, we design a framework to describe security requirements that should be satisfied. Then, we develop an algorithm that syntactically analyzes program code of the functions and determines whether given security requirements are satisfied or not. This algorithm is sound, that is, whenever there is a security flaw, it detects it. expand
|
|
|
Goal-oriented buffer management revisited |
| |
Kurt P. Brown,
Michael J. Carey,
Miron Livny
|
|
Pages: 353-364 |
|
doi>10.1145/233269.233351 |
|
Full text: PDF
|
|
In this paper we revisit the problem of achieving multi-class workload response time goals by automatically adjusting the buffer memory allocations of each workload class. We discuss the virtues and limitations of previous work with respect to a set ...
In this paper we revisit the problem of achieving multi-class workload response time goals by automatically adjusting the buffer memory allocations of each workload class. We discuss the virtues and limitations of previous work with respect to a set of criteria we lay out for judging the success of any goal-oriented resource allocation algorithm. We then introduce the concept of hit rate concavity and develop a new goal-oriented buffer allocation algorithm, called Class Fencing, that is based on this concept. Exploiting the notion of hit rate concavity results in an algorithm that not only is as accurate and stable as our previous work, but also more responsive, more robust, and simpler to implement. expand
|
|
|
Multi-dimensional resource scheduling for parallel queries |
| |
Minos N. Garofalakis,
Yannis E. Ioannidis
|
|
Pages: 365-376 |
|
doi>10.1145/233269.233352 |
|
Full text: PDF
|
|
Scheduling query execution plans is an important component of query optimization in parallel database systems. The problem is particularly complex in a shared-nothing execution environment, where each system node represents a collection of time-shareable ...
Scheduling query execution plans is an important component of query optimization in parallel database systems. The problem is particularly complex in a shared-nothing execution environment, where each system node represents a collection of time-shareable resources (e.g., CPU(s), disk(s), etc.) and communicates with other nodes only by message-passing. Significant research effort has concentrated on only a subset of the various forms of intra-query parallelism so that scheduling and synchronization is simplified. In addition, most previous work has focused its attention on one-dimensional models of parallel query scheduling, effectively ignoring the potential benefits of resource sharing. In this paper, we develop an approach that is more general in both directions, capturing all forms of intra-query parallelism and exploiting sharing of multi-dimensional resource nodes among concurrent plan operators. This allows scheduling a set of independent query tasks (i.e., operator pipelines) to be seen as an instance of the multi-dimensional bin-design problem. Using a novel quantification of coarse grain parallelism, we present a list scheduling heuristic algorithm that is provably near-optimal in the class of coarse grain parallel executions (with a worst-case performance ratio that depends on the number of resources per node and the granularity parameter). We then extend this algorithm to handle the operator precedence constraints in a bushy query plan by splitting the execution of the plan into synchronized phases. Preliminary performance results confirm the effectiveness of our scheduling algorithm compared both to previous approaches and the optimal solution. Finally, we present a technique that allows us to relax the coarse granularity restriction and obtain a list scheduling method that is provably near-optimal in the space of all possible parallel schedules. expand
|
|
|
Semi-automatic, self-adaptive control of garbage collection rates in object databases |
| |
Jonathan E. Cook,
Artur W. Klauser,
Alexander L. Wolf,
Benjamin G. Zorn
|
|
Pages: 377-388 |
|
doi>10.1145/233269.233354 |
|
Full text: PDF
|
|
A fundamental problem in automating object database storage reclamation is determining how often to perform garbage collection. We show that the choice of collection rate can have a significant impact on application performance and that the "best" rate ...
A fundamental problem in automating object database storage reclamation is determining how often to perform garbage collection. We show that the choice of collection rate can have a significant impact on application performance and that the "best" rate depends on the dynamic behavior of the application, tempered by the particular performance goals of the user. We describe two semi-automatic, self-adaptive policies for controlling collection rate that we have developed to address the problem. Using trace-driven simulations, we evaluate the performance of the policies on a test database application that demonstrates two distinct reclustering behaviors. Our results show that the policies are effective at achieving user-specified levels of I/O operations and database garbage percentage. We also investigate the sensitivity of the policies over a range of object connectivities. The evaluation demonstrates that semi-automatic, self-adaptive policies are a practical means for flexibly controlling garbage collection rate. expand
|
|
|
Towards effective and efficient free space management |
| |
Mark L. McAuliffe,
Michael J. Carey,
Marvin H. Solomon
|
|
Pages: 389-400 |
|
doi>10.1145/233269.233355 |
|
Full text: PDF
|
|
An important problem faced by many database management systems is the "online object placement problem"--the problem of choosing a disk page to hold a newly allocated object. In the absence of clustering criteria, the goal is to maximize storage utilization. ...
An important problem faced by many database management systems is the "online object placement problem"--the problem of choosing a disk page to hold a newly allocated object. In the absence of clustering criteria, the goal is to maximize storage utilization. For main-memory based systems, simple heuristics exist that provide reasonable space utilization in the worst case and excellent utilization in typical cases. However, the storage management problem for databases includes significant additional challenges, such as minimizing I/O traffic, coping with crash recovery, and gracefully integrating space management with locking and logging.We survey several object placement algorithms, including techniques that can be found in commercial and research database systems. We then present a new object placement algorithm that we have designed for use in Shore, an object-oriented database system under development at the University of Wisconsin--Madison. Finally, we present results from a series of experiments involving actual Shore implementations of some of these algorithms. Our results show that while current object placement algorithms have serious performance deficiencies, including excessive CPU or main memory overhead, I/O traffic, or poor disk utilization, our new algorithm consistently excellent performance in all of these areas. expand
|
|
|
Rule languages and internal algebras for rule-based optimizers |
| |
Mitch Cherniack,
Stanley B. Zdonik
|
|
Pages: 401-412 |
|
doi>10.1145/233269.233356 |
|
Full text: PDF
|
|
Rule-based optimizers and optimizer generators use rules to specify query transformations. Rules act directly on query representations, which typically are based on query algebras. But most algebras complicate rule formulation, and rules over these algebras ...
Rule-based optimizers and optimizer generators use rules to specify query transformations. Rules act directly on query representations, which typically are based on query algebras. But most algebras complicate rule formulation, and rules over these algebras must often resort to calling to externally defined bodies of code. Code makes rules difficult to formulate, prove correct and reason about, and therefore compromises the effectiveness of rule-based systems.In this paper we present KOLA: a combinator-based algebra designed to simplify rule formulation. KOLA is not a user language, and KOLA's variable-free queries are difficult for humans to read. But KOLA is an effective internal algebra because its combinator-style makes queries manipulable and structurally revealing. As a result, rules over KOLA queries are easily expressed without the need for supplemental code. We illustrate this point, first by showing some transformations that despite their simplicity, require head and body routines when expressed over algebras that include variables. We show that these transformations are expressible without supplemental routines in KOLA. We then show complex transformations of a class of nested queries expressed over KOLA. Nested query optimization, while having been studied before, have seriously challenged the rule-based paradigm. expand
|
|
|
Evaluating queries with generalized path expressions |
| |
Vassilis Christophides,
Sophie Cluet,
Guido Moerkotte
|
|
Pages: 413-422 |
|
doi>10.1145/233269.233358 |
|
Full text: PDF
|
|
In the past few years, query languages featuring generalized path expressions have been proposed. These languages allow the interrogation of both data and structure. They are powerful and essential for a number of applications. However, until now, their ...
In the past few years, query languages featuring generalized path expressions have been proposed. These languages allow the interrogation of both data and structure. They are powerful and essential for a number of applications. However, until now, their evaluation has relied on a rather naive and inefficient algorithm.In this paper, we extend an object algebra with two new operators and present some interesting rewriting techniques for queries featuring generalized path expressions. We also show how a query optimizer can integrate the new techniques. expand
|
|
|
Query execution techniques for caching expensive methods |
| |
Joseph M. Hellerstein,
Jeffrey F. Naughton
|
|
Pages: 423-434 |
|
doi>10.1145/233269.233359 |
|
Full text: PDF
|
|
Object-Relational and Object-Oriented DBMSs allow users to invoke time-consuming ("expensive") methods in their queries. When queries containing these expensive methods are run on data with duplicate values, time is wasted redundantly computing methods ...
Object-Relational and Object-Oriented DBMSs allow users to invoke time-consuming ("expensive") methods in their queries. When queries containing these expensive methods are run on data with duplicate values, time is wasted redundantly computing methods on the same value. This problem has been studied in the context of programming languages, where "memoization" is the standard solution. In the database literature, sorting has been proposed to deal with this problem. We compare these approaches along with a third solution, a variant of unary hybrid hashing which we call Hybrid Cache. We demonstrate that Hybrid Cache always dominates memoization, and significantly outperforms sorting in many instances. This provides new insights into the tradeoff between hashing and sorting for unary operations. Additionally, our Hybrid Cache algorithm includes some new optimization for unary hybrid hashing, which can be used for other applications such as grouping and duplicate elimination. We conclude with a discussion of techniques for caching multiple expensive methods in a single query, and raise some new optimization problems in choosing caching techniques. expand
|
|
|
Cost-based optimization for magic: algebra and implementation |
| |
Praveen Seshadri,
Joseph M. Hellerstein,
Hamid Pirahesh,
T. Y. Cliff Leung,
Raghu Ramakrishnan,
Divesh Srivastava,
Peter J. Stuckey,
S. Sudarshan
|
|
Pages: 435-446 |
|
doi>10.1145/233269.233360 |
|
Full text: PDF
|
|
Magic sets rewriting is a well-known optimization heuristic for complex decision-support queries. There can be many variants of this rewriting even for a single query, which differ greatly in execution performance. We propose cost-based techniques for ...
Magic sets rewriting is a well-known optimization heuristic for complex decision-support queries. There can be many variants of this rewriting even for a single query, which differ greatly in execution performance. We propose cost-based techniques for selecting an efficient variant from the many choices.Our first contribution is a practical scheme that models magic sets rewriting as a special join method that can be added to any cost-based query optimizer. We derive cost formulas that allow an optimizer to choose the best variant of the rewriting and to decide whether it is beneficial. The order of complexity of the optimization process is preserved by limiting the search space in a reasonable manner. We have implemented this technique in IBM's DB2 C/S V2 database system. Our performance measurements demonstrate that the cost-based magic optimization technique performs well, and that without it, several poor decisions could be made.Our second contribution is a formal algebraic model of magic sets rewriting, based on an extension of the multiset relational algebra, which cleanly defines the search space and can be used in a rule-based optimizer. We introduce the multiset θ-semijoin operator, and derive equivalence rules involving this operator. We demonstrate that magic sets rewriting for non-recursive SQL queries can be modeled as a sequential composition of these equivalence rules. expand
|
|
|
Materialized view maintenance and integrity constraint checking: trading space for time |
| |
Kenneth A. Ross,
Divesh Srivastava,
S. Sudarshan
|
|
Pages: 447-458 |
|
doi>10.1145/233269.233361 |
|
Full text: PDF
|
|
We investigate the problem of incremental maintenance of an SQL view in the face of database updates, and show that it is possible to reduce the total time cost of view maintenance by materializing (and maintaining) additional views. We formulate the ...
We investigate the problem of incremental maintenance of an SQL view in the face of database updates, and show that it is possible to reduce the total time cost of view maintenance by materializing (and maintaining) additional views. We formulate the problem of determining the optimal set of additional views to materialize as an optimization problem over the space of possible view sets (which includes the empty set). The optimization problem is harder than query optimization since it has to deal with multiple view sets, updates of multiple relations, and multiple ways of maintaining each view set for each updated relation.We develop a memoing solution for the problem; the solution can be implemented using the expression DAG representation used in rule-based optimizers such as Volcano. We demonstrate that global optimization cannot, in general, be achieved by locally optimizing each materialized subview, because common subexpressions between different materialized subviews can allow nonoptimal local plans to be combined into an optimal global plan. We identify conditions on materialized subviews in the expression DAG when local optimization is possible. Finally, we suggest heuristics that can be used to efficiently determine a useful set of additional views to materialize.Our results are particularly important for the efficient checking of assertions (complex integrity constraints) in the SQL-92 standard, since the incremental checking of such integrity constraints is known to be essentially equivalent to the view maintenance problem. expand
|
|
|
Maintaining database consistency in presence of value dependencies in multidatabase systems |
| |
Claire Morpain,
Michéle Cart,
Jean Ferrié,
Jean-François Pons
|
|
Pages: 459-468 |
|
doi>10.1145/233269.233362 |
|
Full text: PDF
|
|
The emergence of new criteria specifically adapted to multidatabase systems, in response to constraints imposed by global serializability, leads to restrictive hypotheses in order to ensure correctness of executions. This is the case with the two ...
The emergence of new criteria specifically adapted to multidatabase systems, in response to constraints imposed by global serializability, leads to restrictive hypotheses in order to ensure correctness of executions. This is the case with the two level serializability presented in [6], that ensures strongly correct executions if transaction programs are Local Database Preserving (LDP). The main drawback of the LDP hypothesis is that it relies on rigorous programming. The principal objective of this paper has been to suppress this drawback while conserving the strong correctness of 2LSR executions We propose defining precisely the notion of value dependencies, and managing them so as not to impose the LDP property. expand
|
|
|
Algorithms for deferred view maintenance |
| |
Latha S. Colby,
Timothy Griffin,
Leonid Libkin,
Inderpal Singh Mumick,
Howard Trickey
|
|
Pages: 469-480 |
|
doi>10.1145/233269.233364 |
|
Full text: PDF
|
|
Materialized views and view maintenance are important for data warehouses, retailing, banking, and billing applications. We consider two related view maintenance problems: 1) how to maintain views after the base tables have already been modified, and ...
Materialized views and view maintenance are important for data warehouses, retailing, banking, and billing applications. We consider two related view maintenance problems: 1) how to maintain views after the base tables have already been modified, and 2) how to minimize the time for which the view is inaccessible during maintenance.Typically, a view is maintained immediately, as a part of the transaction that updates the base tables. Immediate maintenance imposes a significant overhead on update transactions that cannot be tolerated in many applications. In contrast, deferred maintenance allows a view to become inconsistent with its definition. A refresh operation is used to reestablish consistency. We present new algorithms to incrementally refresh a view during deferred maintenance. Our algorithms avoid a state bug that has artificially limited techniques previously used for deferred maintenance.Incremental deferred view maintenance requires auxiliary tables that contain information recorded since the last view refresh. We present three scenarios for the use of auxiliary tables and show how these impact per-transaction overhead and view refresh time. Each scenario is described by an invariant that is required to hold in all database states. We then show that, with the proper choice of auxiliary tables, it is possible to lower both per-transaction overhead and view refresh time. expand
|
|
|
A framework for supporting data integration using the materialized and virtual approaches |
| |
Richard Hull,
Gang Zhou
|
|
Pages: 481-492 |
|
doi>10.1145/233269.233365 |
|
Full text: PDF
|
|
This paper presents a framework for data integration currently under development in the Squirrel project. The framework is based on a special class of mediators, called Squirrel integration mediators. These mediators can support the traditional ...
This paper presents a framework for data integration currently under development in the Squirrel project. The framework is based on a special class of mediators, called Squirrel integration mediators. These mediators can support the traditional virtual and materialized approaches, and also hybrids of them.In the Squirrel mediators, a relation in the integrated view can be supported as (a) fully materialized, (b) fully virtual, or (c) partially materialized (i.e., with some attributes materialized and other attributes virtual). In general, (partially) materialized relations of the integrated view are maintained by incremental updates from the source databases. Squirrel mediators provide two approaches for doing this: (1) materialize all needed auxiliary data, so that data sources do not have to be queried when processing the incremental updates; or (2) leave some or all of the auxiliary data virtual, and query selected source databases when processing incremental updates.The paper presents formal notions of consistency and "freshness" for integrated views defined over multiple autonomous source databases. It is shown that Squirrel mediators satisfy these properties. expand
|
|
|
Change detection in hierarchically structured information |
| |
Sudarshan S. Chawathe,
Anand Rajaraman,
Hector Garcia-Molina,
Jennifer Widom
|
|
Pages: 493-504 |
|
doi>10.1145/233269.233366 |
|
Full text: PDF
|
|
Detecting and representing changes to data is important for active databases, data warehousing, view maintenance, and version and configuration management. Most previous work in change management has dealt with flat-file and relational data; we focus ...
Detecting and representing changes to data is important for active databases, data warehousing, view maintenance, and version and configuration management. Most previous work in change management has dealt with flat-file and relational data; we focus on hierarchically structured data. Since in many cases changes must be computed from old and new versions of the data, we define the hierarchical change detection problem as the problem of finding a "minimum-cost edit script" that transforms one data tree to another, and we present efficient algorithms for computing such an edit script. Our algorithms make use of some key domain characteristics to achieve substantially better performance than previous, general-purpose algorithms. We study the performance of our algorithms both analytically and empirically, and we describe the application of our techniques to hierarchically structured documents. expand
|
|
|
A query language and optimization techniques for unstructured data |
| |
Peter Buneman,
Susan Davidson,
Gerd Hillebrand,
Dan Suciu
|
|
Pages: 505-516 |
|
doi>10.1145/233269.233368 |
|
Full text: PDF
|
|
A new kind of data model has recently emerged in which the database is not constrained by a conventional schema. Systems like ACeDB, which has become very popular with biologists, and the recent Tsimmis proposal for data integration organize data in ...
A new kind of data model has recently emerged in which the database is not constrained by a conventional schema. Systems like ACeDB, which has become very popular with biologists, and the recent Tsimmis proposal for data integration organize data in tree-like structures whose components can be used equally well to represent sets and tuples. Such structures allow great flexibility y in data representation.What query language is appropriate for such structures? Here we propose a simple language UnQL for querying data organized as a rooted, edge-labeled graph. In this model, relational data may be represented as fixed-depth trees, and on such trees UnQL is equivalent to the relational algebra. The novelty of UnQL consists in its programming constructs for arbitrarily deep data and for cyclic structures. While strictly more powerful than query languages with path expressions like XSQL, UnQL can still be efficiently evaluated. We describe new optimization techniques for the deep or "vertical" dimension of UnQL queries. Furthermore, we show that known optimization techniques for operators on flat relations apply to the "horizontal" dimension of UnQL. expand
|
|
|
Is GUI programming a database research problem? |
| |
Nita Goyal,
Charles Hoch,
Ravi Krishnamurthy,
Brian Meckler,
Michael Suckow
|
|
Pages: 517-528 |
|
doi>10.1145/233269.233369 |
|
Full text: PDF
|
|
Programming nontrivial GUI applications is currently an arduous task. Just as the use of a declarative language simplified the programming of database applications, we ask whether we can do the same for GUI programming? Can we then import a large body ...
Programming nontrivial GUI applications is currently an arduous task. Just as the use of a declarative language simplified the programming of database applications, we ask whether we can do the same for GUI programming? Can we then import a large body of knowledge from database research? We answer these questions by describing our experience in building nontrivial GUI applications initially using C++ programming and subsequently using Logic++, a higher order Horn clause logic language on complex objects with object-oriented features. We abstract a GUI application as a set of event handlers. Each event handler can be conceptualized as a transition from the old screen/program state to a new screen/program state. We use a data centric view of the screen/program state (i.e., every entity on the screen corresponds to proxy datum in the program) and express each event handler as a query dependent update, albeit a complicated one. To express such complicated updates we use Logic++. The proxy data are expressed as derived views that are materialized on the screen. Therefore, the system must be active in maintaining these materialized views. Consequently, each event handler is conceptually an update followed by a fixpoint computation of the proxy data. Based on our experience in building the GUI system, we observe that many database techniques such as view maintenance, active DB, concurrency control, recovery, optimization as well as language concepts such as higher order logic are useful in the context of GUI programming. expand
|
|
|
Accessing relational databases from the World Wide Web |
| |
Tam Nguyen,
V. Srinivasan
|
|
Pages: 529-540 |
|
doi>10.1145/233269.233371 |
|
Full text: PDF
|
|
With the growing popularity of the internet and the World Wide Web (Web), there is a fast growing demand for access to database management systems (DBMS) from the Web. We describe here techniques that we invented to bridge the gap between HTML, the standard ...
With the growing popularity of the internet and the World Wide Web (Web), there is a fast growing demand for access to database management systems (DBMS) from the Web. We describe here techniques that we invented to bridge the gap between HTML, the standard markup language of the Web, and SQL, the standard query language used to access relational DBMS. We propose a flexible general purpose variable substitution mechanism that provides cross-language variable substitution between HTML input and SQL query strings as well as between SQL result rows and HTML output thus enabling the application developer to use the full capabilities of HTML for creation of query forms and reports, and SQL for queries and updates. The cross-language variable substitution mechanism has been used in the design and implementation of a system called DB2 WWW Connection that enables quick and easy construction of applications that access relational DBMS data from the Web. An end user of these DB2 WWW applications sees only the forms for his or her requests and resulting reports. A user fills out the forms, points and clicks to navigate the forms and to access the database as determined by the application. expand
|
|
|
The ins and outs (and everything in between) of data warehousing |
| |
Phil Fernandez,
Donovan Schneider
|
|
Page: 541 |
|
doi>10.1145/233269.280347 |
|
Full text: PDF
|
|
|
|
|
Repository system engineering |
| |
Pillip A. Bernstein
|
|
Page: 542 |
|
doi>10.1145/233269.280348 |
|
Full text: PDF
|
|
|
|
|
databases and visualization |
| |
Daniel A. Keim
|
|
Page: 543 |
|
doi>10.1145/233269.280349 |
|
Full text: PDF
|
|
|
|
|
State of the art in workflow management research and products |
| |
C. Mohan
|
|
Page: 544 |
|
doi>10.1145/233269.280350 |
|
Full text: PDF
|
|
In the last few years, workflow management has become a hot topic in the research community and, especially, in the commercial arena. Workflow management is multidisciplinary in nature encompassing many aspects of computing: database management, distributed ...
In the last few years, workflow management has become a hot topic in the research community and, especially, in the commercial arena. Workflow management is multidisciplinary in nature encompassing many aspects of computing: database management, distributed client-server systems, transaction management, mobile computing, business process reengineering, integration of legacy and new applications, and heterogeneity of hardware and software. Many academic and industrial research projects are underway. Numerous successful products have been released. Standardization efforts are in progress under the auspices of the Workflow Management Coalition. As has happened in the RDBMS area with respect to some topics, in the workflow area also, some of the important real-life problems faced by customers and product developers are not being tackled by researchers. This tutorial will survey the state of the art in workflow management research and products. expand
|
|
|
Data mining techniques |
| |
Jiawei Han
|
|
Page: 545 |
|
doi>10.1145/233269.280351 |
|
Full text: PDF
|
|
Data mining, or knowledge discovery in databases, has been popularly recognized as an important research issue with broad applications. We provide a comprehensive survey, in database perspective, on the data mining techniques developed recently. Several ...
Data mining, or knowledge discovery in databases, has been popularly recognized as an important research issue with broad applications. We provide a comprehensive survey, in database perspective, on the data mining techniques developed recently. Several major kinds of data mining methods, including generalization, characterization, classification, clustering, association, evolution, pattern matching, data visualization, and meta-rule guided mining, will be reviewed. Techniques for mining knowledge in different kinds of databases, including relational, transaction, object-oriented, spatial, and active databases, as well as global information systems, will be examined. Potential data mining applications and some research issues will also be discussed. expand
|
|
|
Thinksheet: a tool for tailoring complex documents |
| |
Peter Piatko,
Roman Yangarber,
Daoi Lin,
Dennis Shasha
|
|
Page: 546 |
|
doi>10.1145/233269.280352 |
|
Full text: PDF
|
|
|
|
|
HyperStorM—administering structured documents using object-oriented database technology |
| |
Klemens Böhm,
Karl Aberer
|
|
Page: 547 |
|
doi>10.1145/233269.280353 |
|
Full text: PDF
|
|
|
|
|
DBSim: a simulation tool for predicting database performance |
| |
Mark Lefler,
Mark Stokrp,
Craig Wong
|
|
Page: 548 |
|
doi>10.1145/233269.280354 |
|
Full text: PDF
|
|
|
|
|
LORE: a Lightweight Object REpository for semistructured data |
| |
Dallan Quass,
Jennifer Widom,
Roy Goldman,
Kevin Haas,
Qingshan Luo,
Jason McHugh,
Svetlozar Nestorov,
Anand Rajaraman,
Hugo Rivero,
Serge Abiteboul,
Jeff Ullman,
Janet Wiener
|
|
Page: 549 |
|
doi>10.1145/233269.280355 |
|
Full text: PDF
|
|
|
|
|
DBMiner: interactive mining of multiple-level knowledge in relational databases |
| |
Jaiwei Han,
Youngjian Fu,
Wei Wang,
Jenny Chiang,
Osmar R. Zaïane,
Krzysztof Koperski
|
|
Page: 550 |
|
doi>10.1145/233269.280356 |
|
Full text: PDF
|
|
Based on our years-of-research, a data mining system, DB-Miner, has been developed for interactive mining of multiple-level knowledge in large relational databases. The system implements a wide spectrum of data mining functions, including generalization, ...
Based on our years-of-research, a data mining system, DB-Miner, has been developed for interactive mining of multiple-level knowledge in large relational databases. The system implements a wide spectrum of data mining functions, including generalization, characterization, association, classification, and prediction. By incorporation of several interesting data mining techniques, including attribute-oriented induction, progressive deepening for mining multiple-level rules, and meta-rule guided knowledge mining, the system provides a user-friendly, interactive data mining environment with good performance. expand
|
|
|
prospector: a content-based multimedia server for massively parallel architectures |
| |
S. Choo,
W. O'Connell,
G. Linerman,
H. Chen,
K. Ganapathy,
A. Biliris,
E. Panagos,
D. Schrader
|
|
Page: 551 |
|
doi>10.1145/233269.280357 |
|
Full text: PDF
|
|
The Prospector Multimedia Object Manager prototype is a general-purpose content analysis multimedia server designed for massively parallel processor environments. Prospector defines and manipulates user defined functions which are invoked in parallel ...
The Prospector Multimedia Object Manager prototype is a general-purpose content analysis multimedia server designed for massively parallel processor environments. Prospector defines and manipulates user defined functions which are invoked in parallel to analyze/manipulate the contents of multimedia objects. Several computationally intensive applications of this technology based on large persistent datasets include: fingerprint matching, signature verification, face recognition, and speech recognition/translation [OIS96]. expand
|
|
|
METU interoperable database system |
| |
Asuman Dogac,
Ugur Halici,
Ebru Kilic,
Gokhan Ozhan,
Fatma Ozcan,
Sena Nural,
Cevdet Dengi,
Sema Mancuhan,
Budak Arpinar,
Pinar Koksal,
Cem Evrendilek
|
|
Page: 552 |
|
doi>10.1145/233269.280358 |
|
Full text: PDF
|
|
|
|
|
SONAR: system for optimized numeric association rules |
| |
Takeshi Fukuda,
Yasuhiko Morimoto,
Shinichi Morishita,
Takeshi Tokuyama
|
|
Page: 553 |
|
doi>10.1145/233269.280359 |
|
Full text: PDF
|
|
|
|
|
CapBasED-AMS: a capability-based and event-driven activity management system |
| |
Patrick C. K. Hung,
Helen P. Yeung,
Kamalakar Karlapalem
|
|
Page: 554 |
|
doi>10.1145/233269.280360 |
|
Full text: PDF
|
|
|
|
|
The MultiView project: object-oriented view technology and applications |
| |
E. A. Rundensteiner,
H. A. Kuno,
Y.-G. Ra,
V. Crestana-Taube,
M. C. Jones,
P. J. Marron
|
|
Page: 555 |
|
doi>10.1145/233269.280361 |
|
Full text: PDF
|
|
|
|
|
BeSS: storage support for interactive visualization systems |
| |
A. Biliris,
T. A. Funkhouser,
W. O'Connell,
E. Panagos
|
|
Page: 556 |
|
doi>10.1145/233269.280362 |
|
Full text: PDF
|
|
|
|
|
The Garlic project |
| |
M. Tork Roth,
M. Arya,
L. Haas,
M. Carey,
W. Cody,
R. Fagin,
P. Schwarz,
J. Thomas,
E. Wimmers
|
|
Page: 557 |
|
doi>10.1145/233269.280363 |
|
Full text: PDF
|
|
The goal of the Garlic [1] project is to build a multimedia information system capable of integrating data that resides in different database systems as well as in a variety of non-database data servers. This integration must be enabled while maintaining ...
The goal of the Garlic [1] project is to build a multimedia information system capable of integrating data that resides in different database systems as well as in a variety of non-database data servers. This integration must be enabled while maintaining the independence of the data servers, and without creating copies of their data. "Multimedia" should be interpreted broadly to mean not only images, video, and audio, but also text and application specific data types (e.g., CAD drawings, medical objects, …). Since much of this data is naturally modeled by objects, Garlic provides an object-oriented schema to applications, interprets object queries, creates execution plans for sending pieces of queries to the appropriate data servers, and assembles query results for delivery back to the applications. A significant focus of the project is support for "intelligent" data servers, i.e., servers that provide media-specific indexing and query capabilities [2]. Database optimization technology is being extended to deal with heterogeneous collections of data servers so that efficient data access plans can be employed for multi-repository queries.A prototype of the Garlic system has been operational since January 1995. Queries are expressed in an SQL-like query language that has been extended to include object-oriented features such as reference-valued attributes and nested sets. In addition to a C++ API, Garlic supports a novel query/browser interface called PESTO [3]. This component of Garlic provides end users of the system with a friendly, graphical interface that supports interactive browsing, navigation, and querying of the contents of Garlic databases. Unlike existing interfaces to databases, PESTO allows users to move back and forth seamlessly between querying and browsing activities, using queries to identify interesting subsets of the database, browsing the subset, querying the content of a set-valued attribute of a particularly interesting object in the subset, and so on. expand
|