|
|
SESSION: Keynote talks |
|
|
|
|
The next database revolution |
| |
Jim Gray
|
|
Pages: 1-4 |
|
doi>10.1145/1007568.1007570 |
|
Full text: PDF
|
|
Database system architectures are undergoing revolutionary changes. Most importantly, algorithms and data are being unified by integrating programming languages with the database system. This gives an extensible object-relational system where non-procedural ...
Database system architectures are undergoing revolutionary changes. Most importantly, algorithms and data are being unified by integrating programming languages with the database system. This gives an extensible object-relational system where non-procedural relational operators manipulate object sets. Coupled with this, each DBMS is now a web service. This has huge implications for how we structure applications. DBMSs are now object containers. Queues are the first objects to be added. These queues are the basis for transaction processing and workflow applications. Future workflow systems are likely to be built on this core. Data cubes and online analytic processing are now baked into most DBMSs. Beyond that, DBMSs have a framework for data mining and machine learning algorithms. Decision trees, Bayes nets, clustering, and time series analysis are built in; new algorithms can be added. There is a rebirth of column stores for sparse tables and to optimize bandwidth. Text, temporal, and spatial data access methods, along with their probabilistic reasoning have been added to database systems. Allowing approximate and probabilistic answers is essential for many applications. Many believe that XML and xQuery will be the main data structure and access pattern. Database systems must accommodate that perspective. External data increasingly arrives as streams to be compared to historical data; so stream-processing operators are being added to the DBMS. Publish-subscribe systems invert the data-query ratios; incoming data is compared against millions of queries rather than queries searching millions of records. Meanwhile, disk and memory capacities are growing much faster than their bandwidth and latency, so the database systems increasingly use huge main memories and sequential disk access. These changes mandate a much more dynamic query optimization strategy - one that adapts to current conditions and selectivities rather than having a static plan. Intelligence is moving to the periphery of the network. Each disk and each sensor will be a competent database machine. Relational algebra is a convenient way to program these systems. Database systems are now expected to be self-managing, self-healing, and always-up. We researchers and developers have our work cut out for us in delivering all these features. expand
|
|
|
The role of cryptography in database security |
| |
Ueli Maurer
|
|
Pages: 5-10 |
|
doi>10.1145/1007568.1007571 |
|
Full text: PDF
|
|
In traditional database security research, the database is usually assumed to be trustworthy. Under this assumption, the goal is to achieve security against external attacks (e.g. from hackers) and possibly also against users trying to obtain information ...
In traditional database security research, the database is usually assumed to be trustworthy. Under this assumption, the goal is to achieve security against external attacks (e.g. from hackers) and possibly also against users trying to obtain information beyond their privileges, for instance by some type of statistical inference. However, for many database applications such as health information systems there exist conflicting interests of the database owner and the users or organizations interacting with the database, and also between the users. Therefore the database cannot necessarily be assumed to be fully trusted.In this extended abstract we address the problem of defining and achieving security in a context where the database is not fully trusted, i.e., when the users must be protected against a potentially malicious database. Moreover, we address the problem of the secure aggregation of databases owned by mutually mistrusting organisations, for example by competing companies. expand
|
|
|
SESSION: Research sessions: stream management |
|
|
|
|
Adaptive stream resource management using Kalman Filters |
| |
Ankur Jain,
Edward Y. Chang,
Yuan-Fang Wang
|
|
Pages: 11-22 |
|
doi>10.1145/1007568.1007573 |
|
Full text: PDF
|
|
To answer user queries efficiently, a stream management system must handle continuous, high-volume, possibly noisy, and time-varying data streams. One major research area in stream management seeks to allocate resources (such as network bandwidth and ...
To answer user queries efficiently, a stream management system must handle continuous, high-volume, possibly noisy, and time-varying data streams. One major research area in stream management seeks to allocate resources (such as network bandwidth and memory) to query plans, either to minimize resource usage under a precision requirement, or to maximize precision of results under resource constraints. To date, many solutions have been proposed; however, most solutions are ad hoc with hard-coded heuristics to generate query plans. In contrast, we perceive stream resource management as fundamentally a filtering problem, in which the objective is to filter out as much data as possible to conserve resources, provided that the precision standards can be met. We select the Kalman Filter as a general and adaptive filtering solution for conserving resources. The Kalman Filter has the ability to adapt to various stream characteristics, sensor noise, and time variance. Furthermore, we realize a significant performance boost by switching from traditional methods of caching static data (which can soon become stale) to our method of caching dynamic procedures that can predict data reliably at the server without the clients' involvement. In this work we focus on minimization of communication overhead for both synthetic and real-world streams. Through examples and empirical studies, we demonstrate the flexibility and effectiveness of using the Kalman Filter as a solution for managing trade-offs between precision of results and resources in satisfying stream queries. expand
|
|
|
Online event-driven subsequence matching over financial data streams |
| |
Huanmei Wu,
Betty Salzberg,
Donghui Zhang
|
|
Pages: 23-34 |
|
doi>10.1145/1007568.1007574 |
|
Full text: PDF
|
|
Subsequence similarity matching in time series databases is an important research area for many applications. This paper presents a new approximate approach for automatic online subsequence similarity matching over massive data streams. With a simultaneous ...
Subsequence similarity matching in time series databases is an important research area for many applications. This paper presents a new approximate approach for automatic online subsequence similarity matching over massive data streams. With a simultaneous on-line segmentation and pruning algorithm over the incoming stream, the resulting piecewise linear representation of the data stream features high sensitivity and accuracy. The similarity definition is based on a permutation followed by a metric distance function, which provides the similarity search with flexibility, sensitivity and scalability. Also, the metric-based indexing methods can be applied for speed-up. To reduce the system burden, the event-driven similarity search is performed only when there is a potential event. The query sequence is the most recent subsequence of piecewise data representation of the incoming stream which is automatically generated by the system. The retrieved results can be analyzed in different ways according to the requirements of specific applications. This paper discusses an application for future data movement prediction based on statistical information. Experiments on real stock data are performed. The correctness of trend predictions is used to evaluate the performance of subsequence similarity matching. expand
|
|
|
Holistic UDAFs at streaming speeds |
| |
Graham Cormode,
Theodore Johnson,
Flip Korn,
S. Muthukrishnan,
Oliver Spatscheck,
Divesh Srivastava
|
|
Pages: 35-46 |
|
doi>10.1145/1007568.1007575 |
|
Full text: PDF
|
|
Many algorithms have been proposed to approximate holistic aggregates, such as quantiles and heavy hitters, over data streams. However, little work has been done to explore what techniques are required to incorporate these algorithms in a data stream ...
Many algorithms have been proposed to approximate holistic aggregates, such as quantiles and heavy hitters, over data streams. However, little work has been done to explore what techniques are required to incorporate these algorithms in a data stream query processor, and to make them useful in practice.In this paper, we study the performance implications of using user-defined aggregate functions (UDAFs) to incorporate selection-based and sketch-based algorithms for holistic aggregates into a data stream management system's query processing architecture. We identify key performance bottlenecks and tradeoffs, and propose novel techniques to make these holistic UDAFs fast and space-efficient for use in high-speed data stream applications. We evaluate performance using generated and actual IP packet data, focusing on approximating quantiles and heavy hitters. The best of our current implementations can process streaming queries at OC48 speeds (2x 2.4Gbps). expand
|
|
|
SESSION: Research sessions: XML query efficiency |
|
|
|
|
BLAS: an efficient XPath processing system |
| |
Yi Chen,
Susan B. Davidson,
Yifeng Zheng
|
|
Pages: 47-58 |
|
doi>10.1145/1007568.1007577 |
|
Full text: PDF
|
|
We present BLAS, a Bi-LAbeling based System, for efficiently processing complex XPath queries over XML data. BLAS uses P-labeling to process queries involving consecutive child axes, and D-labeling to process queries involving descendant axes traversal. ...
We present BLAS, a Bi-LAbeling based System, for efficiently processing complex XPath queries over XML data. BLAS uses P-labeling to process queries involving consecutive child axes, and D-labeling to process queries involving descendant axes traversal. The XML data is stored in labeled form, and indexed to optimize descendent axis traversals. Three algorithms are presented for translating complex XPath queries to SQL expressions, and two alternate query engines are provided. Experimental results demonstrate that the BLAS system has a substantial performance improvement compared to traditional XPath processing using D-labeling. expand
|
|
|
Efficient processing of XML twig queries with OR-predicates |
| |
Haifeng Jiang,
Hongjun Lu,
Wei Wang
|
|
Pages: 59-70 |
|
doi>10.1145/1007568.1007578 |
|
Full text: PDF
|
|
An XML twig query, represented as a labeled tree, is essentially a complex selection predicate on both structure and content of an XML document. Twig query matching has been identified as a core operation in querying tree-structured XML data. A number ...
An XML twig query, represented as a labeled tree, is essentially a complex selection predicate on both structure and content of an XML document. Twig query matching has been identified as a core operation in querying tree-structured XML data. A number of algorithms have been proposed recently to process a twig query holistically. Those algorithms, however, only deal with twig queries without OR-predicates. A straightforward approach that first decomposes a twig query with OR-predicates into multiple twig queries without OR-predicates and then combines their results is obviously not optimal in most cases. In this paper, we study novel holistic-processing algorithms for twig queries with OR-predicates without decomposition. In particular, we present a merge-based algorithm for sorted XML data and an index-based algorithm for indexed XML data. We show that holistic processing is much more efficient than the decomposition approach. Furthermore, we show that using indexes can significantly improve the performance for matching twig queries with OR-predicates, especially when the queries have large inputs but relatively small outputs. expand
|
|
|
Tree logical classes for efficient evaluation of XQuery |
| |
Stelios Paparizos,
Yuqing Wu,
Laks V. S. Lakshmanan,
H. V. Jagadish
|
|
Pages: 71-82 |
|
doi>10.1145/1007568.1007579 |
|
Full text: PDF
|
|
XML is widely praised for its flexibility in allowing repeated and missing sub-elements. However, this flexibility makes it challenging to develop a bulk algebra, which typically manipulates sets of objects with identical structure. A set of XML elements, ...
XML is widely praised for its flexibility in allowing repeated and missing sub-elements. However, this flexibility makes it challenging to develop a bulk algebra, which typically manipulates sets of objects with identical structure. A set of XML elements, say of type book, may have members that vary greatly in structure, e.g. in the number of author sub-elements. This kind of heterogeneity may permeate the entire document in a recursive fashion: e.g., different authors of the same or different book may in turn greatly vary in structure. Even when the document conforms to a schema, the flexible nature of schemas for XML still allows such significant variations in structure among elements in a collection. Bulk processing of such heterogeneous sets is problematic.In this paper, we introduce the notion of logical classes (LC) of pattern tree nodes, and generalize the notion of pattern tree matching to handle node logical classes. This abstraction pays off significantly in allowing us to reason with an inherently heterogeneous collection of elements in a uniform, homogeneous way. Based on this, we define a Tree Logical Class (TLC) algebra that is capable of handling the heterogeneity arising in XML query processing, while avoiding redundant work. We present an algorithm to obtain a TLC algebra expression from an XQuery statement (for a large fragment of XQuery). We show how to implement the TLC algebra efficiently, introducing the nest-join as an important physical operator for XML query processing. We show that evaluation plans generated using the TLC algebra not only are simpler but also perform better than those generated by competing approaches. TLC is the algebra used in the Timber [8] system developed at the University of Michigan. expand
|
|
|
SESSION: Research sessions: Web, XML and IR |
|
|
|
|
FleXPath: flexible structure and full-text querying for XML |
| |
Sihem Amer-Yahia,
Laks V. S. Lakshmanan,
Shashank Pandit
|
|
Pages: 83-94 |
|
doi>10.1145/1007568.1007581 |
|
Full text: PDF
|
|
Querying XML data is a well-explored topic with powerful database-style query languages such as XPath and XQuery set to become W3C standards. An equally compelling paradigm for querying XML documents is full-text search on textual content. In this paper, ...
Querying XML data is a well-explored topic with powerful database-style query languages such as XPath and XQuery set to become W3C standards. An equally compelling paradigm for querying XML documents is full-text search on textual content. In this paper, we study fundamental challenges that arise when we try to integrate these two querying paradigms.While keyword search is based on approximate matching, XPath has exact match semantics. We address this mismatch by considering queries on structure as a "template", and looking for answers that best match this template and the full-text search. To achieve this, we provide an elegant definition of relaxation on structure and define primitive operators to span the space of relaxations. Query answering is now based on ranking potential answers on structural and full-text search conditions. We set out certain desirable principles for ranking schemes and propose natural ranking schemes that adhere to these principles. We develop efficient algorithms for answering top-K queries and discuss results from a comprehensive set of experiments that demonstrate the utility and scalability of the proposed framework and algorithms. expand
|
|
|
An interactive clustering-based approach to integrating source query interfaces on the deep Web |
| |
Wensheng Wu,
Clement Yu,
AnHai Doan,
Weiyi Meng
|
|
Pages: 95-106 |
|
doi>10.1145/1007568.1007582 |
|
Full text: PDF
|
|
An increasing number of data sources now become available on the Web, but often their contents are only accessible through query interfaces. For a domain of interest, there often exist many such sources with varied coverage or querying capabilities. ...
An increasing number of data sources now become available on the Web, but often their contents are only accessible through query interfaces. For a domain of interest, there often exist many such sources with varied coverage or querying capabilities. As an important step to the integration of these sources, we consider the integration of their query interfaces. More specifically, we focus on the crucial step of the integration: accurately matching the interfaces. While the integration of query interfaces has received more attentions recently, current approaches are not sufficiently general: (a) they all model interfaces with flat schemas; (b) most of them only consider 1:1 mappings of fields over the interfaces; (c) they all perform the integration in a blackbox-like fashion and the whole process has to be restarted from scratch if anything goes wrong; and (d) they often require laborious parameter tuning. In this paper, we propose an interactive, clustering-based approach to matching query interfaces. The hierarchical nature of interfaces is captured with ordered trees. Varied types of complex mappings of fields are examined and several approaches are proposed to effectively identify these mappings. We put the human integrator back in the loop and propose several novel approaches to the interactive learning of parameters and the resolution of uncertain mappings. Extensive experiments are conducted and results show that our approach is highly effective. expand
|
|
|
Understanding Web query interfaces: best-effort parsing with hidden syntax |
| |
Zhen Zhang,
Bin He,
Kevin Chen-Chuan Chang
|
|
Pages: 107-118 |
|
doi>10.1145/1007568.1007583 |
|
Full text: PDF
|
|
Recently, the Web has been rapidly "deepened" by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says- or what query ...
Recently, the Web has been rapidly "deepened" by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says- or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some "concerted structure," by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar- and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax- that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach-it achieves above 85% accuracy for extracting query conditions across random sources. expand
|
|
|
Using the structure of Web sites for automatic segmentation of tables |
| |
Kristina Lerman,
Lise Getoor,
Steven Minton,
Craig Knoblock
|
|
Pages: 119-130 |
|
doi>10.1145/1007568.1007584 |
|
Full text: PDF
|
|
Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely ...
Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites. expand
|
|
|
SESSION: Research sessions: data mining applications |
|
|
|
|
Identifying similarities, periodicities and bursts for online search queries |
| |
Michail Vlachos,
Christopher Meek,
Zografoula Vagena,
Dimitrios Gunopulos
|
|
Pages: 131-142 |
|
doi>10.1145/1007568.1007586 |
|
Full text: PDF
|
|
We present several methods for mining knowledge from the query logs of the MSN search engine. Using the query logs, we build a time series for each query word or phrase (e.g., 'Thanksgiving' or 'Christmas gifts') where the elements of the time series ...
We present several methods for mining knowledge from the query logs of the MSN search engine. Using the query logs, we build a time series for each query word or phrase (e.g., 'Thanksgiving' or 'Christmas gifts') where the elements of the time series are the number of times that a query is issued on a day. All of the methods we describe use sequences of this form and can be applied to time series data generally. Our primary goal is the discovery of semantically similar queries and we do so by identifying queries with similar demand patterns. Utilizing the best Fourier coefficients and the energy of the omitted components, we improve upon the state-of-the-art in time-series similarity matching. The extracted sequence features are then organized in an efficient metric tree index structure. We also demonstrate how to efficiently and accurately discover the important periods in a time-series. Finally we propose a simple but effective method for identification of bursts (long or short-term). Using the burst information extracted from a sequence, we are able to efficiently perform 'query-by-burst' on the database of time-series. We conclude the presentation with the description of a tool that uses the described methods, and serves as an interactive exploratory data discovery tool for the MSN query database. expand
|
|
|
FARMER: finding interesting rule groups in microarray datasets |
| |
Gao Cong,
Anthony K. H. Tung,
Xin Xu,
Feng Pan,
Jiong Yang
|
|
Pages: 143-154 |
|
doi>10.1145/1007568.1007587 |
|
Full text: PDF
|
|
Microarray datasets typically contain large number of columns but small number of rows. Association rules have been proved to be useful in analyzing such datasets. However, most existing association rule mining algorithms are unable to efficiently handle ...
Microarray datasets typically contain large number of columns but small number of rows. Association rules have been proved to be useful in analyzing such datasets. However, most existing association rule mining algorithms are unable to efficiently handle datasets with large number of columns. Moreover, the number of association rules generated from such datasets is enormous due to the large number of possible column combinations.In this paper, we describe a new algorithm called FARMER that is specially designed to discover association rules from microarray datasets. Instead of finding individual association rules, FARMER finds interesting rule groups which are essentially a set of rules that are generated from the same set of rows. Unlike conventional rule mining algorithms, FARMER searches for interesting rules in the row enumeration space and exploits all user-specified constraints including minimum support, confidence and chi-square to support efficient pruning. Several experiments on real bioinformatics datasets show that FARMER is orders of magnitude faster than previous association rule mining algorithms. expand
|
|
|
Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data |
| |
Graham Cormode,
Flip Korn,
S. Muthukrishnan,
Divesh Srivastava
|
|
Pages: 155-166 |
|
doi>10.1145/1007568.1007588 |
|
Full text: PDF
|
|
Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view ...
Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view of such data is important to summarize, visualize, and analyze. We develop the aggregate view based on certain hierarchically organized sets of large-valued regions ("heavy hitters"). Such Hierarchical Heavy Hitters (HHHs) were previously introduced as a crucial aggregation technique in one dimension. In order to analyze the wider range of data warehousing applications and realistic IP data streams, we generalize this problem to multiple dimensions.We identify and study two variants of HHHs for multi-dimensional data, namely the "overlap" and "split" cases, depending on how an aggregate computed for a child node in the multi-dimensional hierarchy is propagated to its parent element(s). For data warehousing applications, we present offline algorithms that take multiple passes over the data and produce the exact HHHs. For data stream applications, we present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees.We show experimentally, using real and synthetic data, that our proposed online algorithms yield outputs which are very similar (virtually identical, in many cases) to their offline counterparts. The lattice property of the product of hierarchical dimensions ("diamond") is crucially exploited in our online algorithms to track approximate HHHs using only a small, fixed number of statistics per candidate node, regardless of the number of dimensions. expand
|
|
|
Cost-based labeling of groups of mass spectra |
| |
Lei Chen,
Zheng Huang,
Raghu Ramakrishnan
|
|
Pages: 167-178 |
|
doi>10.1145/1007568.1007589 |
|
Full text: PDF
|
|
We make two main contributions in this paper. First, we motivate and introduce a novel class of data mining problems that arise in labeling a group of mass spectra, specifically for analysis of atmospheric aerosols, but with natural applications to market-basket ...
We make two main contributions in this paper. First, we motivate and introduce a novel class of data mining problems that arise in labeling a group of mass spectra, specifically for analysis of atmospheric aerosols, but with natural applications to market-basket datasets. This builds upon other recent work in which we introduced the problem of labeling a single spectrum, and is motivated by the advent of a new generation of Aerosol Time-of-Flight Spectrometers, which are capable of generating mass spectra for hundreds of aerosol particles per minute. We also describe two algorithms for group labeling, which differ greatly in how they utilize a linear programming (LP) solver, and also differ greatly from algorithms for labeling a single spectrum.Our second contribution is to show how to automatically select between these two algorithms in a cost-based manner, analogous to how a query optimizer selects from a space of query plans. While the details are specific to the labeling problem, we believe that is a promising first step towards a general framework for cost-based data mining, and opens up an important direction for future search. expand
|
|
|
SESSION: Research sessions: non-standard query processing |
|
|
|
|
Optimization of query streams using semantic prefetching |
| |
Ivan T. Bowman,
Kenneth Salem
|
|
Pages: 179-190 |
|
doi>10.1145/1007568.1007591 |
|
Full text: PDF
|
|
Streams of relational queries submitted by client applications to database servers contain patterns that can be used to predict future requests. We present the Scalpel system, which detects these patterns and optimizes request streams using context-based ...
Streams of relational queries submitted by client applications to database servers contain patterns that can be used to predict future requests. We present the Scalpel system, which detects these patterns and optimizes request streams using context-based predictions of future requests. Scalpel uses its predictions to provide a form of semantic prefetching, which involves combining a predicted series of requests into a single request that can be issued immediately. Scalpel's semantic prefetching reduces not only the latency experienced by the application but also the total cost of query evaluation. We describe how Scalpel learns to predict optimizable request patterns by observing the application's request stream during a training phase. We also describe the types of query pattern rewrites that Scalpel's cost-based optimizer considers. Finally, we present empirical results that show the costs and benefits of Scalpel's optimizations. expand
|
|
|
Buffering databse operations for enhanced instruction cache performance |
| |
Jingren Zhou,
Kenneth A. Ross
|
|
Pages: 191-202 |
|
doi>10.1145/1007568.1007592 |
|
Full text: PDF
|
|
As more and more query processing work can be done in main memory access is becoming a significant cost component of database operations. Recent database research has shown that most of the memory stalls are due to second-level cache data misses and ...
As more and more query processing work can be done in main memory access is becoming a significant cost component of database operations. Recent database research has shown that most of the memory stalls are due to second-level cache data misses and first-level instruction cache misses. While a lot of research has focused on reducing the data cache misses, relatively little research has been done on improving the instruction cache performance of database systems.We first answer the question "Why does a database system incur so many instruction cache misses?" We demonstrate that current demand-pull pipelined query execution engines suffer from significant instruction cache thrashing between different operators. We propose techniques to buffer database operations during query execution to avoid instruction cache thrashing. We implement a new light-weight "buffer" operator and study various factors which may affect the cache performance. We also introduce a plan refinement algorithm that considers the query plan and decides whether it is beneficial to add additional "buffer" operators and where to put them. The benefit is mainly from better instruction locality and better hardware branch prediction. Our techniques can be easily integrated into current database systems without significant changes. Our experiments in a memory-resident PostgreSQL database system show that buffering techniques can reduce the number of instruction cache misses by up to 80% and improve query performance by up to 15%. expand
|
|
|
Rank-aware query optimization |
| |
Ihab F. Ilyas,
Rahul Shah,
Walid G. Aref,
Jeffrey Scott Vitter,
Ahmed K. Elmagarmid
|
|
Pages: 203-214 |
|
doi>10.1145/1007568.1007593 |
|
Full text: PDF
|
|
Ranking is an important property that needs to be fully supported by current relational query engines. Recently, several rank-join query operators have been proposed based on rank aggregation algorithms. Rank-join operators progressively rank the join ...
Ranking is an important property that needs to be fully supported by current relational query engines. Recently, several rank-join query operators have been proposed based on rank aggregation algorithms. Rank-join operators progressively rank the join results while performing the join operation. The new operators have a direct impact on traditional query processing and optimization.We introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operators. We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator. To our knowledge, this paper is the first effort in estimating the needed input size for optimal rank aggregation algorithms. Costing ranking plans, although challenging, is key to the full integration of rank-join operators in real-world query processing engines. We experimentally evaluate our framework by modifying the query optimizer of an open-source database management system. The experiments show the validity of our framework and the accuracy of the proposed estimation model. expand
|
|
|
Fast computation of database operations using graphics processors |
| |
Naga K. Govindaraju,
Brandon Lloyd,
Wei Wang,
Ming Lin,
Dinesh Manocha
|
|
Pages: 215-226 |
|
doi>10.1145/1007568.1007594 |
|
Full text: PDF
|
|
We present new algorithms for performing fast computation of several common database operations on commodity graphics processors. Specifically, we consider operations such as conjunctive selections, aggregations, and semi-linear queries, which are essential ...
We present new algorithms for performing fast computation of several common database operations on commodity graphics processors. Specifically, we consider operations such as conjunctive selections, aggregations, and semi-linear queries, which are essential computational components of typical database, data warehousing, and data mining applications. While graphics processing units (GPUs) have been designed for fast display of geometric primitives, we utilize the inherent pipelining and parallelism, single instruction and multiple data (SIMD) capabilities, and vector processing functionality of GPUs, for evaluating boolean predicate combinations and semi-linear queries on attributes and executing database operations efficiently. Our algorithms take into account some of the limitations of the programming model of current GPUs and perform no data rearrangements. Our algorithms have been implemented on a programmable GPU (e.g. NVIDIA's GeForce FX 5900) and applied to databases consisting of up to a million records. We have compared their performance with an optimized implementation of CPU-based algorithms. Our experiments indicate that the graphics processor available on commodity computer systems is an effective co-processor for performing database operations. expand
|
|
|
SESSION: Research sessions: new styles of XML |
|
|
|
|
Lazy query evaluation for Active XML |
| |
Serge Abiteboul,
Omar Benjelloun,
Bogdan Cautis,
Ioana Manolescu,
Tova Milo,
Nicoleta Preda
|
|
Pages: 227-238 |
|
doi>10.1145/1007568.1007596 |
|
Full text: PDF
|
|
In this paper, we study query evaluation on Active XML documents (AXML for short), a new generation of XML documents that has recently gained popularity. AXML documents are XML documents whose content is given partly extensionally, by explicit data elements, ...
In this paper, we study query evaluation on Active XML documents (AXML for short), a new generation of XML documents that has recently gained popularity. AXML documents are XML documents whose content is given partly extensionally, by explicit data elements, and partly intensionally, by embedded calls to Web services, which can be invoked to generate data.A major challenge in the efficient evaluation of queries over such documents is to detect which calls may bring data that is relevant for the query execution, and to avoid the materialization of irrelevant information. The problem is intricate, as service calls may be embedded anywhere in the document, and service invocations possibly return data containing calls to new services. Hence, the detection of relevant calls becomes a continuous process. Also, a good analysis must take the service signatures into consideration.We formalize the problem, and provide algorithms to solve it. We also present an implementation that is compliant with XML and Web services standards, and is used as part of the ActiveXML system. Finally, we experimentally measure the performance gains obtained by a careful filtering of the service calls to be triggered. expand
|
|
|
Data stream management for historical XML data |
| |
Sujoe Bose,
Leonidas Fegaras
|
|
Pages: 239-250 |
|
doi>10.1145/1007568.1007597 |
|
Full text: PDF
|
|
We are presenting a framework for continuous querying of time-varying streamed XML data. A continuous stream in our framework consists of a finite XML document followed by a continuous stream of updates. The unit of update is an XML fragment, which can ...
We are presenting a framework for continuous querying of time-varying streamed XML data. A continuous stream in our framework consists of a finite XML document followed by a continuous stream of updates. The unit of update is an XML fragment, which can relate to other fragments through system-generated unique IDs. The reconstruction of temporal data from continuous updates at a current time is never materialized and historical queries operate directly on the fragmented streams. We are incorporating temporal constructs to XQuery with minimal changes to the existing language structure to support continuous querying of time-varying streams of XML data. Our extensions use time projections to capture time-sliding windows, version control for tuple-based windows, and coincidence queries to synchronize events between streams. These XQuery extensions are compiled away to standard XQuery code and the resulting queries operate continuously over the existing fragmented streams. expand
|
|
|
Colorful XML: one hierarchy isn't enough |
| |
H. V. Jagadish,
Laks V. S. Lakshmanan,
Monica Scannapieco,
Divesh Srivastava,
Nuwee Wiwatwattana
|
|
Pages: 251-262 |
|
doi>10.1145/1007568.1007598 |
|
Full text: PDF
|
|
XML has a tree-structured data model, which is used to uniformly represent structured as well as semi-structured data, and also enable concise query specification in XQuery, via the use of its XPath (twig) patterns. This in turn can leverage the recently ...
XML has a tree-structured data model, which is used to uniformly represent structured as well as semi-structured data, and also enable concise query specification in XQuery, via the use of its XPath (twig) patterns. This in turn can leverage the recently developed technology of structural join algorithms to evaluate the query efficiently. In this paper, we identify a fundamental tension in XML data modeling: (i) data represented as deep trees (which can make effective use of twig patterns) are often un-normalized, leading to update anomalies, while (ii) normalized data tends to be shallow, resulting in heavy use of expensive value-based joins in queries.Our solution to this data modeling problem is a novel multi-colored trees (MCT) logical data model, which is an evolutionary extension of the XML data model, and permits trees with multi-colored nodes to signify their participation in multiple hierarchies. This adds significant semantic structure to individual data nodes. We extend XQuery expressions to navigate between structurally related nodes, taking color into account, and also to create new colored trees as restructurings of an MCT database. While MCT serves as a significant evolutionary extension to XML as a logical data model, one of the key roles of XML is for information exchange. To enable exchange of MCT information, we develop algorithms for optimally serializing an MCT database as XML. We discuss alternative physical representations for MCT databases, using relational and native XML databases, and describe an implementation on top of the Timber native XML database. Experimental evaluation, using our prototype implementation, shows that not only are MCT queries/updates more succinct and easier to express than equivalent shallow tree XML queries, but they can also be significantly more efficient to evaluate than equivalent deep and shallow tree XML queries/updates. expand
|
|
|
Approximate XML query answers |
| |
Neoklis Polyzotis,
Minos Garofalakis,
Yannis Ioannidis
|
|
Pages: 263-274 |
|
doi>10.1145/1007568.1007599 |
|
Full text: PDF
|
|
The rapid adoption of XML as the standard for data representation and exchange foreshadows a massive increase in the amounts of XML data collected, maintained, and queried over the Internet or in large corporate data-stores. Inevitably, this will result ...
The rapid adoption of XML as the standard for data representation and exchange foreshadows a massive increase in the amounts of XML data collected, maintained, and queried over the Internet or in large corporate data-stores. Inevitably, this will result in the development of on-line decision support systems, where users and analysts interactively explore large XML data sets through a declarative query interface (e.g., XQuery or XSLT). Given the importance of remaining interactive, such on-line systems can employ approximate query answers as an effective mechanism for reducing response time and providing users with early feedback. This approach has been successfully used in relational systems and it becomes even more compelling in the XML world, where the evaluation of complex queries over massive tree-structured data is inherently more expensive.In this paper, we initiate a study of approximate query answering techniques for large XML databases. Our approach is based on a novel, conceptually simple, yet very effective XML-summarization mechanism: TREESKETCH synopses. We demonstrate that, unlike earlier techniques focusing solely on selectivity estimation, our TREESKETCH synopses are much more effective in capturing the complete tree structure of the underlying XML database. We propose novel construction algorithms for building TREESKETCH summaries of limited size, and describe schemes for processing general XML twig queries over a concise TREESKETCH in order to produce very fast, approximate tree-structured query answers. To quantify the quality of such approximate answers, we propose a novel, intuitive error metric that captures the quality of the approximation in terms of both the overall structure of the XML tree and the distribution of document edges. Experimental results on real-life and synthetic data sets verify the effectiveness of our TREESKETCH synopses in producing fast, accurate approximate answers and demonstrate their benefits over previously proposed techniques that focus solely on selectivity estimation. In particular, TREESKETCHes yield faster, more accurate approximate answers and selectivity estimates, and are more efficient to construct. To the best of our knowledge, ours is the first work to address the timely problem of producing fast, approximate tree-structured answers for complex XML queries. expand
|
|
|
SESSION: Research sessions: statistics |
|
|
|
|
A bi-level Bernoulli scheme for database sampling |
| |
Peter J. Haas,
Christian König
|
|
Pages: 275-286 |
|
doi>10.1145/1007568.1007601 |
|
Full text: PDF
|
|
Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods currently ...
Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods currently used in most commercial systems. By adjusting the parameters of the method, the user can systematically trade off processing speed and statistical precision---the appropriate choice of parameter settings becomes a query optimization problem. We indicate the SQL extensions needed to support bi-level sampling and determine the optimal parameter settings for an important class of sampling queries with explicit time or accuracy constraints. As might be expected, row-level sampling is preferable when data values on each page are homogeneous, whereas page-level sampling should be used when data values on a page vary widely. Perhaps surprisingly, we show that in many cases the optimal sampling policy is of the "bang-bang" type: we identify a "page-heterogeneity index" (PHI) such that optimal sampling is as "row-like" as possible if the PHI is less than 1 and as "page-like" as possible otherwise. The PHI depends upon both the query and the data, and can be estimated by means of a pilot sample. Because pilot sampling can be nontrivial to implement in commercial database systems, we also give a heuristic method for setting the sampling parameters; the method avoids pilot sampling by using a small number of summary statistics that are maintained in the system catalog. Results from over 1100 experiments on 372 real and synthetic data sets show that the heuristic method performs optimally about half of the time, and yields sampling errors within a factor of 2.2 of optimal about 93% of the time. The heuristic method is stable over a wide range of sampling rates and performs best in the most critical cases, where the data is highly clustered or skewed. expand
|
|
|
Effective use of block-level sampling in statistics estimation |
| |
Surajit Chaudhuri,
Gautam Das,
Utkarsh Srivastava
|
|
Pages: 287-298 |
|
doi>10.1145/1007568.1007602 |
|
Full text: PDF
|
|
Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of ...
Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of block-level sampling for histograms as well as distinct-value estimations. For histogram construction, we give a novel two-phase adaptive method in which the sample size required to reach a desired accuracy is decided based on a first phase sample. This method is significantly faster than previous iterative methods proposed for the same problem. For distinct-value estimation, we show that existing estimators designed for uniform-random samples may perform very poorly if used directly on block-level samples. We present a key technique that computes an appropriate subset of a block-level sample that is suitable for use with most existing estimators. This, to the best of our knowledge, is the first principled method for distinct-value estimation with block-level samples. We provide extensive experimental results validating our methods. expand
|
|
|
Online maintenance of very large random samples |
| |
Christopher Jermaine,
Abhijit Pol,
Subramanian Arumugam
|
|
Pages: 299-310 |
|
doi>10.1145/1007568.1007603 |
|
Full text: PDF
|
|
Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that ...
Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. Our algorithms are also suitable for biased or unequal probability sampling. expand
|
|
|
Conditional selectivity for statistics on query expressions |
| |
Nicolas Bruno,
Surajit Chaudhuri
|
|
Pages: 311-322 |
|
doi>10.1145/1007568.1007604 |
|
Full text: PDF
|
|
Cardinality estimation during query optimization relies on simplifying assumptions that usually do not hold in practice. To diminish the impact of inaccurate estimates during optimization, statistics on query expressions (SITs) have been previously proposed. ...
Cardinality estimation during query optimization relies on simplifying assumptions that usually do not hold in practice. To diminish the impact of inaccurate estimates during optimization, statistics on query expressions (SITs) have been previously proposed. These statistics help directly model the distribution of tuples on query sub-plans. Past work in statistics on query expressions has exploited view matching technology to harness their benefits. In this paper we argue against such an approach as it overlooks significant opportunities for improvement in cardinality estimations. We then introduce a framework to reason with SITs based on the notion of conditional selectivity. We present a dynamic programming algorithm to efficiently find the most accurate selectivity estimation for given queries, and discuss how such an approach can be incorporated into existing optimizers with a small number of changes. Finally, we demonstrate experimentally that our technique results in superior cardinality estimations than previous approaches with very little overhead. expand
|
|
|
SESSION: Research sessions: indexing and tuning |
|
|
|
|
Transaction support for indexed summary views |
| |
Goetz Graefe,
Michael Zwilling
|
|
Pages: 323-334 |
|
doi>10.1145/1007568.1007606 |
|
Full text: PDF
|
|
Materialized views have become a standard technique for performance improvement in decision support databases and for a variety of monitoring purposes. In order to avoid inconsistencies and thus unpredictable query results, materialized views and their ...
Materialized views have become a standard technique for performance improvement in decision support databases and for a variety of monitoring purposes. In order to avoid inconsistencies and thus unpredictable query results, materialized views and their indexes should be maintained immediately within user transaction just like indexes on ordinary tables. Unfortunately, the smaller a materialized view is, the higher the concurrency contention between queries and updates as well as among concurrent updates. Therefore, we have investigated methods that reduce contention without forcing users to sacrifice serializability and thus predictable application semantics. These methods extend escrow locking with multi-granularity (hierarchical) locking, snapshot transactions, multi-version concurrency control, key range locking, and system transactions, i.e., multiple proven database implementation techniques. The complete design eliminates all contention between pure read transactions and pure update transactions as well as contention among pure update transactions as well as contention among pure update transactions; it enables maximal concurrency of mixed read-write transactions with other transactions; it supports bulk operations such as data import and online index creation; and it provides recovery for transaction, media, and system failures. expand
|
|
|
Graph indexing: a frequent structure-based approach |
| |
Xifeng Yan,
Philip S. Yu,
Jiawei Han
|
|
Pages: 335-346 |
|
doi>10.1145/1007568.1007607 |
|
Full text: PDF
|
|
Graph has become increasingly important in modelling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via ...
Graph has become increasingly important in modelling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. In this paper, we investigate the issues of indexing graphs and propose a novel solution by applying a graph mining technique. Different from the existing path-based methods, our approach, called gIndex, makes use of frequent substructure as the basic indexing feature. Frequent substructures are ideal candidates since they explore the intrinsic characteristics of the data and are relatively stable to database updates. To reduce the size of index structure, two techniques, size-increasing support constraint and discriminative fragments, are introduced. Our performance study shows that gIndex has 10 times smaller index size, but achieves 3--10 times better performance in comparison with a typical path-based method, GraphGrep. The gIndex approach not only provides and elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit form data mining, especially frequent pattern mining. Furthermore, the concepts developed here can be applied to indexing sequences, trees, and other complicated structures as well. expand
|
|
|
The Priority R-tree: a practically efficient and worst-case optimal R-tree |
| |
Lars Arge,
Mark de Berg,
Herman J. Haverkort,
Ke Yi
|
|
Pages: 347-358 |
|
doi>10.1145/1007568.1007608 |
|
Full text: PDF
|
|
We present the Priority R-tree, or PR-tree, which is the first R-tree variant that always answers a window query using O((N/B)1 1/d + T/B) I/Os, where N is the number of d-dimensional (hyper-) rectangles stored in the R-tree, B ...
We present the Priority R-tree, or PR-tree, which is the first R-tree variant that always answers a window query using O((N/B)1 1/d + T/B) I/Os, where N is the number of d-dimensional (hyper-) rectangles stored in the R-tree, B is the disk block size, and T is the output size. This is provably asymptotically optimal and significantly better than other R-tree variants, where a query may visit all N/B leaves in the tree even when T = 0. We also present an extensive experimental study of the practical performance of the PR-tree using both real-life and synthetic data. This study shows that the PR-tree performs similar to the best known R-tree variants on real-life and relatively nicely distributed data, but outperforms them significantly on more extreme data. expand
|
|
|
Integrating vertical and horizontal partitioning into automated physical database design |
| |
Sanjay Agrawal,
Vivek Narasayya,
Beverly Yang
|
|
Pages: 359-370 |
|
doi>10.1145/1007568.1007609 |
|
Full text: PDF
|
|
In addition to indexes and materialized views, horizontal and vertical partitioning are important aspects of physical design in a relational database system that significantly impact performance. Horizontal partitioning also provides manageability; database ...
In addition to indexes and materialized views, horizontal and vertical partitioning are important aspects of physical design in a relational database system that significantly impact performance. Horizontal partitioning also provides manageability; database administrators often require indexes and their underlying tables partitioned identically so as to make common operations such as backup/restore easier. While partitioning is important, incorporating partitioning makes the problem of automating physical design much harder since: (a) The choices of partitioning can strongly interact with choices of indexes and materialized views. (b) A large new space of physical design alternatives must be considered. (c) Manageability requirements impose a new constraint on the problem. In this paper, we present novel techniques for designing a scalable solution to this integrated physical design problem that takes both performance and manageability into account. We have implemented our techniques and evaluated it on Microsoft SQL Server. Our experiments highlight: (a) the importance of taking an integrated approach to automated physical design and (b) the scalability of our techniques. expand
|
|
|
SESSION: Research sessions: data integration |
|
|
|
|
Constraint-based XML query rewriting for data integration |
| |
Cong Yu,
Lucian Popa
|
|
Pages: 371-382 |
|
doi>10.1145/1007568.1007611 |
|
Full text: PDF
|
|
We study the problem of answering queries through a target schema, given a set of mappings between one or more source schemas and this target schema, and given that the data is at the sources. The schemas can be any combination of relational or XML schemas, ...
We study the problem of answering queries through a target schema, given a set of mappings between one or more source schemas and this target schema, and given that the data is at the sources. The schemas can be any combination of relational or XML schemas, and can be independently designed. In addition to the source-to-target mappings, we consider as part of the mapping scenario a set of target constraints specifying additional properties on the target schema. This becomes particularly important when integrating data from multiple data sources with overlapping data and when such constraints can express data merging rules at the target. We define the semantics of query answering in such an integration scenario, and design two novel algorithms, basic query rewrite and query resolution, to implement the semantics. The basic query rewrite algorithm reformulates target queries in terms of the source schemas, based on the mappings. The query resolution algorithm generates additional rewritings that merge related information from multiple sources and assemble a coherent view of the data, by incorporating target constraints. The algorithms are implemented and then evaluated using a comprehensive set of experiments based on both synthetic and real-life data integration scenarios. expand
|
|
|
iMAP: discovering complex semantic matches between database schemas |
| |
Robin Dhamankar,
Yoonkyong Lee,
AnHai Doan,
Alon Halevy,
Pedro Domingos
|
|
Pages: 383-394 |
|
doi>10.1145/1007568.1007612 |
|
Full text: PDF
|
|
Creating semantic matches between disparate data sources is fundamental to numerous data sharing efforts. Manually creating matches is extremely tedious and error-prone. Hence many recent works have focused on automating the matching process. To date, ...
Creating semantic matches between disparate data sources is fundamental to numerous data sharing efforts. Manually creating matches is extremely tedious and error-prone. Hence many recent works have focused on automating the matching process. To date, however, virtually all of these works deal only with one-to-one (1-1) matches, such as address = location. They do not consider the important class of more complex matches, such as address = concat (city, state) and room-pric = room-rate* (1 + tax-rate).We describe the iMAP system which semi-automatically discovers both 1-1 and complex matches. iMAP reformulates schema matching as a search in an often very large or infinite match space. To search effectively, it employs a set of searchers, each discovering specific types of complex matches. To further improve matching accuracy, iMAP exploits a variety of domain knowledge, including past complex matches, domain integrity constraints, and overlap data. Finally, iMAP introduces a novel feature that generates explanation of predicted matches, to provide insights into the matching process and suggest actions to converge on correct matches quickly. We apply iMAP to several real-world domains to match relational tables, and show that it discovers both 1-1 and complex matches with high accuracy. expand
|
|
|
Adapting to source properties in processing data integration queries |
| |
Zachary G. Ives,
Alon Y. Halevy,
Daniel S. Weld
|
|
Pages: 395-406 |
|
doi>10.1145/1007568.1007613 |
|
Full text: PDF
|
|
An effective query optimizer finds a query plan that exploits the characteristics of the source data. In data integration, little is known in advance about sources' properties, which necessitates the use of adaptive query processing techniques ...
An effective query optimizer finds a query plan that exploits the characteristics of the source data. In data integration, little is known in advance about sources' properties, which necessitates the use of adaptive query processing techniques to adjust query processing on-the-fly. Prior work in adaptive query processing has focused on compensating for delays and adjusting for mis-estimated cardinality or selectivity values. In this paper, we present a generalized architecture for adaptive query processing and introduce a new technique, called adaptive data partitioning (ADP), which is based on the idea of dividing the source data into regions, each executed by different, complementary plans. We show how this model can be applied in novel ways to not only correct for underestimated selectivity and cardinality values, but also to discover and exploit order in the source data, and to detect and exploit source data that can be effectively pre-aggregated. We experimentally compare a number of alternative strategies and show that our approach is effective. expand
|
|
|
SESSION: Research sessions: stream QP |
|
|
|
|
Adaptive ordering of pipelined stream filters |
| |
Shivnath Babu,
Rajeev Motwani,
Kamesh Munagala,
Itaru Nishizawa,
Jennifer Widom
|
|
Pages: 407-418 |
|
doi>10.1145/1007568.1007615 |
|
Full text: PDF
|
|
We consider the problem of pipelined filters, where a continuous stream of tuples is processed by a set of commutative filters. Pipelined filters are common in stream applications and capture a large class of multiway stream joins. We focus on ...
We consider the problem of pipelined filters, where a continuous stream of tuples is processed by a set of commutative filters. Pipelined filters are common in stream applications and capture a large class of multiway stream joins. We focus on the problem of ordering the filters adaptively to minimize processing cost in an environment where stream and filter characteristics vary unpredictably over time. Our core algorithm, A-Greedy (for Adaptive Greedy), has strong theoretical guarantees: If stream and filter characteristics were to stabilize, A-Greedy would converge to an ordering within a small constant factor of optimal. (In experiments A-Greedy usually converges to the optimal ordering.) One very important feature of A-Greedy is that it monitors and responds to selectivities that are correlated across filters (i.e., that are nonindependent), which provides the strong quality guarantee but incurs run-time overhead. We identify a three-way tradeoff among provable convergence to good orderings, run-time overhead, and speed of adaptivity. We develop a suite of variants of A-Greedy that lie at different points on this tradeoff spectrum. We have implemented all our algorithms in the STREAM prototype Data Stream Management System and a thorough performance evaluation is presented. expand
|
|
|
Static optimization of conjunctive queries with sliding windows over infinite streams |
| |
Ahmed M. Ayad,
Jeffrey F. Naughton
|
|
Pages: 419-430 |
|
doi>10.1145/1007568.1007616 |
|
Full text: PDF
|
|
We define a framework for static optimization of sliding window conjunctive queries over infinite streams. When computational resources are sufficient, we propose that the goal of optimization should be to find an execution plan that minimizes resource ...
We define a framework for static optimization of sliding window conjunctive queries over infinite streams. When computational resources are sufficient, we propose that the goal of optimization should be to find an execution plan that minimizes resource usage within the available resource constraints. When resources are insufficient, on the other hand, we propose that the goal should be to find an execution plan that sheds some of the input load (by randomly dropping tuples) to keep resource usage within bounds while maximizing the output rate. An intuitive approach to load shedding suggests starting with the plan that would be optimal if resources were sufficient and adding "drop boxes" to this plan. We find this to be often times suboptimal - in many instances the optimal partial answer plan results from adding drop boxes to plans that are not optimal in the unlimited resource case. In view of this, we use our framework to investigate an approach to optimization that unifies the placement of drop boxes and the choice of the query plan from which to drop tuples. The effectiveness of our optimizer is experimentally validated and the results show the promise of this approach. expand
|
|
|
Dynamic plan migration for continuous queries over data streams |
| |
Yali Zhu,
Elke A. Rundensteiner,
George T. Heineman
|
|
Pages: 431-442 |
|
doi>10.1145/1007568.1007617 |
|
Full text: PDF
|
|
Dynamic plan migration is concerned with the on-the-fly transition from one continuous query plan to a semantically equivalent yet more efficient plan. Migration is important for stream monitoring systems where long-running queries may have to withstand ...
Dynamic plan migration is concerned with the on-the-fly transition from one continuous query plan to a semantically equivalent yet more efficient plan. Migration is important for stream monitoring systems where long-running queries may have to withstand fluctuations in stream workloads and data characteristics. Existing migration methods generally adopt a pause-drain-resume strategy that pauses the processing of new data, purges all old data in the existing plan, until finally the new plan can be plugged into the system. However, these existing strategies do not address the problem of migrating query plans that contain stateful operators, such as joins. We now develop solutions for online plan migration for continuous stateful plans. In particular, in this paper, we propose two alternative strategies, called the moving state strategy and the parallel track strategy, one exploiting reusability and the second employs parallelism to seamlessly migrate between continuous join plans without affecting the results of the query. We develop cost models for both migration strategies to analytically compare them. We embed these migration strategies into the CAPE [7], a prototype system of a stream query engine, and conduct a comparative experimental study to evaluate these two strategies for window-based join plans. Our experimental results illustrate that the two strategies can vary significantly in terms of output rates and intermediate storage spaces given distinct system configurations and stream workloads. expand
|
|
|
SESSION: Research sessions: clustering |
|
|
|
|
Clustering objects on a spatial network |
| |
Man Lung Yiu,
Nikos Mamoulis
|
|
Pages: 443-454 |
|
doi>10.1145/1007568.1007619 |
|
Full text: PDF
|
|
Clustering is one of the most important analysis tasks in spatial databases. We study the problem of clustering objects, which lie on edges of a large weighted spatial network. The distance between two objects is defined by their shortest path distance ...
Clustering is one of the most important analysis tasks in spatial databases. We study the problem of clustering objects, which lie on edges of a large weighted spatial network. The distance between two objects is defined by their shortest path distance over the network. Past algorithms are based on the Euclidean distance and cannot be applied for this setting. We propose variants of partitioning, density-based, and hierarchical methods. Their effectiveness and efficiency is evaluated for collections of objects which appear on real road networks. The results show that our methods can correctly identify clusters and they are scalable for large problems. expand
|
|
|
Computing Clusters of Correlation Connected objects |
| |
Christian Böhm,
Karin Kailing,
Peer Kröger,
Arthur Zimek
|
|
Pages: 455-466 |
|
doi>10.1145/1007568.1007620 |
|
Full text: PDF
|
|
The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association ...
The detection of correlations between different features in a set of feature vectors is a very important data mining task because correlation indicates a dependency between the features or some association of cause and effect between them. This association can be arbitrarily complex, i.e. one or more features might be dependent from a combination of several other features. Well-known methods like the principal components analysis (PCA) can perfectly find correlations which are global, linear, not hidden in a set of noise vectors, and uniform, i.e. the same type of correlation is exhibited in all feature vectors. In many applications such as medical diagnosis, molecular biology, time sequences, or electronic commerce, however, correlations are not global since the dependency between features can be different in different subgroups of the set. In this paper, we propose a method called 4C (Computing Correlation Connected Clusters) to identify local subgroups of the data objects sharing a uniform but arbitrarily complex correlation. Our algorithm is based on a combination of PCA and density-based clustering (DBSCAN). Our method has a determinate result and is robust against noise. A broad comparative evaluation demonstrates the superior performance of 4C over competing methods such as DBSCAN, CLIQUE and ORCLUS. expand
|
|
|
Incremental and effective data summarization for dynamic hierarchical clustering |
| |
Samer Nassar,
Jörg Sander,
Corrine Cheng
|
|
Pages: 467-478 |
|
doi>10.1145/1007568.1007621 |
|
Full text: PDF
|
|
Mining informative patterns from very large, dynamically changing databases poses numerous interesting challenges. Data summarizations (e.g., data bubbles) have been proposed to compress very large static databases into representative points suitable ...
Mining informative patterns from very large, dynamically changing databases poses numerous interesting challenges. Data summarizations (e.g., data bubbles) have been proposed to compress very large static databases into representative points suitable for subsequent effective hierarchical cluster analysis. In many real world applications, however, the databases dynamically change due to frequent insertions and deletions, possibly changing the data distribution and clustering structure over time. Completely reapplying both the data summarization and the clustering algorithm to detect the changes in the clustering structure and update the uncovered data patterns following such deletions and insertions is prohibitively expensive for large fast changing databases. In this paper, we propose a new scheme to maintain data bubbles incrementally. By using incremental data bubbles, a high-quality hierarchical clustering is quickly available at any point in time. In our scheme, a quality measure for incremental data bubbles is used to identify data bubbles that do not compress well their underlying data points after certain insertions and deletions. Only these data bubbles are re-built using efficient split and merge operations. An extensive experimental evaluation shows that the incremental data bubbles provide significantly faster data summarization than completely re-building the data bubbles after a certain number of insertions and deletions, and are effective in preserving (and in some cases even improving) the quality of the data summarization. expand
|
|
|
SESSION: Research sessions: XML PubSub and indexing |
|
|
|
|
Implementing a scalable XML publish/subscribe system using relational database systems |
| |
Feng Tian,
Berthold Reinwald,
Hamid Pirahesh,
Tobias Mayr,
Jussi Myllymaki
|
|
Pages: 479-490 |
|
doi>10.1145/1007568.1007623 |
|
Full text: PDF
|
|
An XML publish/subscribe system needs to match many XPath queries (subscriptions) over published XML documents. The performance and scalability of the matching algorithm is essential for the system when the number of XPath subscriptions is large. Earlier ...
An XML publish/subscribe system needs to match many XPath queries (subscriptions) over published XML documents. The performance and scalability of the matching algorithm is essential for the system when the number of XPath subscriptions is large. Earlier solutions to this problem usually built large finite state automata for all the XPath subscriptions in memory. The scalability of this approach is limited by the amount of available physical memory. In this paper, we propose an implementation that uses a relational database as the matching engine. The heavy lifting part of evaluating a large number of subscriptions is done inside a relational database using indices and joins. We described several different implementation strategies and presented a performance evaluation. The system shows very good performance and scalability in our experiments, handling millions of subscriptions with moderate amount of physical memory. expand
|
|
|
Incremental maintenance of XML structural indexes |
| |
Ke Yi,
Hao He,
Ioana Stanoi,
Jun Yang
|
|
Pages: 491-502 |
|
doi>10.1145/1007568.1007624 |
|
Full text: PDF
|
|
Increasing popularity of XML in recent years has generated much interest in query processing over graph-structured data. To support efficient evaluation of path expressions, many structural indexes have been proposed. The most popular ones are the 1-index, ...
Increasing popularity of XML in recent years has generated much interest in query processing over graph-structured data. To support efficient evaluation of path expressions, many structural indexes have been proposed. The most popular ones are the 1-index, based on the notion of graph bisimilarity, and the recently proposed A(k)-index, based on the notion of local similarity to provide a trade-off between index size and query answering power. For these indexes to be practical, we need effective and efficient incremental maintenance algorithms to keep them consistent with the underlying data. However, existing update algorithms for structural indexes essentially provide no guarantees on the quality of the index; the updated index is usually larger size than necessary, degrading the performance for subsequent queries.In this paper, we propose update algorithms for the 1-index and the A(k)-index with provable guarantees on the resulting index quality. Our algorithms always maintain a minimal index, i.e., merging any two index nodes would result in an incorrect index. For the 1-index, if the data graph is acyclic, our algorithm further ensures that the index is minimum, i.e., it has the least number of index nodes possible. For the A(k)-index, we show that the minimal index our algorithm maintains is also the unique minimum A(k)-index, for both acyclic and cyclic data graphs. Finally, through experimental evaluation, we demonstrate that our algorithms bring significant improvement over previous methods, in terms of both index size and update time. expand
|
|
|
Incremental evaluation of schema-directed XML publishing |
| |
Philip Bohannon,
Byron Choi,
Wenfei Fan
|
|
Pages: 503-514 |
|
doi>10.1145/1007568.1007625 |
|
Full text: PDF
|
|
When large XML documents published from a database are maintained externally, it is inefficient to repeatedly recompute them when the database is updated. Vastly preferable is incremental update, as common for views stored in a data warehouse. However, ...
When large XML documents published from a database are maintained externally, it is inefficient to repeatedly recompute them when the database is updated. Vastly preferable is incremental update, as common for views stored in a data warehouse. However, to support schema-directed publishing, there may be no simple query that defines the mapping from the database to the external document. To meet the need for efficient incremental update, this paper studies two approaches for incremental evaluation of ATGs [4], a formalism for schema-directed XML publishing. The reduction approach seeks to push as much work as possible to the underlying DBMS. It is based on a relational encoding of XML trees and a nontrivial translation of ATGs to SQL 99 queries with recursion. However, a weakness of this approach is that it relies on high-end DBMS features rather than the lowest common denominator. In contrast, the bud-cut approach pushes only simple queries to the DBNS and performs the bulk of the work in middleware. It capitalizes on the tree-structure of XML views to minimize unnecessary recomputations and leverages optimization techniques developed for XML publishing. While implementation of the reduction approach is not yet in the reach of commercial DBMS, we have implemented the bud-cut approach and experimentally evaluated its performance compared to recomputation. expand
|
|
|
SESSION: Research sessions: P2P and sensor networks |
|
|
|
|
The price of validity in dynamic networks |
| |
Mayank Bawa,
Aristides Gionis,
Hector Garcia-Molina,
Rajeev Motwani
|
|
Pages: 515-526 |
|
doi>10.1145/1007568.1007627 |
|
Full text: PDF
|
|
Massive-scale self-administered networks like Peer-to-Peer and Sensor Networks have data distributed across thousands of participant hosts. These networks are highly dynamic with short-lived hosts being the norm rather than an exception. In recent years, ...
Massive-scale self-administered networks like Peer-to-Peer and Sensor Networks have data distributed across thousands of participant hosts. These networks are highly dynamic with short-lived hosts being the norm rather than an exception. In recent years, researchers have investigated best-effort algorithms to efficiently process aggregate queries (e.g., sum, count, average, minimum and maximum) [6, 13, 21, 34, 35, 37] on these networks. Unfortunately, query semantics for best-effort algorithms are ill-defined, making it hard to reason about guarantees associated with the result returned. In this paper, we specify a correctness condition, single-site validity, with respect to which the above algorithms are best-effort. We present a class of algorithms that guarantee validity in dynamic networks. Experiments on real-life and synthetic network topologies validate performance of our algorithms, revealing the hitherto unknown price of validity. expand
|
|
|
Compressing historical information in sensor networks |
| |
Antonios Deligiannakis,
Yannis Kotidis,
Nick Roussopoulos
|
|
Pages: 527-538 |
|
doi>10.1145/1007568.1007628 |
|
Full text: PDF
|
|
We are inevitably moving into a realm where small and inexpensive wireless devices would be seamlessly embedded in the physical world and form a wireless sensor network in order to perform complex monitoring and computational tasks. Such networks pose ...
We are inevitably moving into a realm where small and inexpensive wireless devices would be seamlessly embedded in the physical world and form a wireless sensor network in order to perform complex monitoring and computational tasks. Such networks pose new challenges in data processing and dissemination because of the limited resources (processing, bandwidth, energy) that such devices possess. In this paper we propose a new technique for compressing multiple streams containing historical data from each sensor. Our method exploits correlation and redundancy among multiple measurements on the same sensor and achieves high degree of data reduction while managing to capture even the smallest details of the recorded measurements. The key to our technique is the base signal, a series of values extracted from the real measurements, used for encoding piece-wise linear correlations among the collected data values. We provide efficient algorithms for extracting the base signal features from the data and for encoding the measurements using these features. Our experiments demonstrate that our method by far outperforms standard approximation techniques like Wavelets. Histograms and the Discrete Cosine Transform, on a variety of error metrics and for real datasets from different domains. expand
|
|
|
Efficient query reformulation in peer data management systems |
| |
Igor Tatarinov,
Alon Halevy
|
|
Pages: 539-550 |
|
doi>10.1145/1007568.1007629 |
|
Full text: PDF
|
|
Peer data management systems (PDMS) offer a flexible architecture for decentralized data sharing. In a PDMS, every peer is associated with a schema that represents the peer's domain of interest, and semantic relationships between peers are provided locally ...
Peer data management systems (PDMS) offer a flexible architecture for decentralized data sharing. In a PDMS, every peer is associated with a schema that represents the peer's domain of interest, and semantic relationships between peers are provided locally between pairs (or small sets) of peers. By traversing semantic paths of mappings, a query over one peer can obtain relevant data from any reachable peer in the network. Semantic paths are traversed by reformulating queries at a peer into queries on its neighbors.Naively following semantic paths is highly inefficient in practice. We describe several techniques for optimizing the reformulation process in a PDMS and validate their effectiveness using real-life data sets. In particular, we develop techniques for pruning paths in the reformulation process and for minimizing the reformulated queries as they are created. In addition, we consider the effect of the strategy we use to search through the space of reformulations. Finally, we show that pre-computing semantic paths in a PDMS can greatly improve the efficiency of the reformulation process. Together, all of these techniques form a basis for scalable query reformulation in PDMS.To enable our optimizations, we developed practical algorithms, of independent interest, for checking containment and minimization of XML queries, and for composing XML mappings. expand
|
|
|
SESSION: Research sessions: security and privacy |
|
|
|
|
Extending query rewriting techniques for fine-grained access control |
| |
Shariq Rizvi,
Alberto Mendelzon,
S. Sudarshan,
Prasan Roy
|
|
Pages: 551-562 |
|
doi>10.1145/1007568.1007631 |
|
Full text: PDF
|
|
Current day database applications, with large numbers of users, require fine-grained access control mechanisms, at the level of individual tuples, not just entire relations/views, to control which parts of the data can be accessed by each user. Fine-grained ...
Current day database applications, with large numbers of users, require fine-grained access control mechanisms, at the level of individual tuples, not just entire relations/views, to control which parts of the data can be accessed by each user. Fine-grained access control is often enforced in the application code, which has numerous drawbacks; these can be avoided by specifying/enforcing access control at the database level. We present a novel fine-grained access control model based on authorization views that allows "authorization-transparent" querying; that is, user queries can be phrased in terms of the database relations, and are valid if they can be answered using only the information contained in these authorization views. We extend earlier work on authorization-transparent querying by introducing a new notion of validity, conditional validity. We give a powerful set of inference rules to check for query validity. We demonstrate the practicality of our techniques by describing how an existing query optimizer can be extended to perform access control checks by incorporating these inference rules. expand
|
|
|
Order preserving encryption for numeric data |
| |
Rakesh Agrawal,
Jerry Kiernan,
Ramakrishnan Srikant,
Yirong Xu
|
|
Pages: 563-574 |
|
doi>10.1145/1007568.1007632 |
|
Full text: PDF
|
|
Encryption is a well established technology for protecting sensitive data. However, once encrypted, data can no longer be easily queried aside from exact matches. We present an order-preserving encryption scheme for numeric data that allows any comparison ...
Encryption is a well established technology for protecting sensitive data. However, once encrypted, data can no longer be easily queried aside from exact matches. We present an order-preserving encryption scheme for numeric data that allows any comparison operation to be directly applied on encrypted data. Query results produced are sound (no false hits) and complete (no false drops). Our scheme handles updates gracefully and new values can be added without requiring changes in the encryption of other values. It allows standard databse indexes to be built over encrypted tables and can easily be integrated with existing database systems. The proposed scheme has been designed to be deployed in application environments in which the intruder can get access to the encrypted database, but does not have prior domain information such as the distribution of values and annot encrypt or decrypt arbitrary values of his choice. The encryption is robust against estimation of the true value in such environments. expand
|
|
|
A formal analysis of information disclosure in data exchange |
| |
Gerome Miklau,
Dan Suciu
|
|
Pages: 575-586 |
|
doi>10.1145/1007568.1007633 |
|
Full text: PDF
|
|
We perform a theoretical study of the following query-view security problem: given a view V to be published, does V logically disclose information about a confidential query S? The problem is motivated by the need to manage ...
We perform a theoretical study of the following query-view security problem: given a view V to be published, does V logically disclose information about a confidential query S? The problem is motivated by the need to manage the risk of unintended information disclosure in today's world of universal data exchange. We present a novel information-theoretic standard for query-view security. This criterion can be used to provide a precise analysis of information disclosure for a host of data exchange scenarios, including multi-party collusion and the use of outside knowledge by an adversary trying to learn privileged facts about the database. We prove a number of theoretical results for deciding security according to this standard. We also generalize our security criterion to account for prior knowledge a user or adversary may possess, and introduce techniques for measuring the magnitude of partical disclosures. We believe these results can be a foundation for practical efforts to secure data exchange frameworks, and also illuminate a nice interaction between logic and probability theory. expand
|
|
|
Secure XML querying with security views |
| |
Wenfei Fan,
Chee-Yong Chan,
Minos Garofalakis
|
|
Pages: 587-598 |
|
doi>10.1145/1007568.1007634 |
|
Full text: PDF
|
|
The prevalent use of XML highlights the need for a generic, flexible access-control mechanism for XML documents that supports efficient and secure query access, without revealing sensitive information unauthorized users. This paper introduces a novel ...
The prevalent use of XML highlights the need for a generic, flexible access-control mechanism for XML documents that supports efficient and secure query access, without revealing sensitive information unauthorized users. This paper introduces a novel paradigm for specifying XML security constraints and investigates the enforcement of such constraints during XML query evaluation. Our approach is based on the novel concept of security views, which provide for each user group (a) an XML view consisting of all and only the information that the users are authorized to access, and (b) a view DTD that the XML view conforms to. Security views effectively protect sensitive data from access and potential inferences by unauthorized user, and provide authorized users with necessary schema information to facilitate effective query formulation and optimization. We propose an efficient algorithm for deriving security view definitions from security policies (defined on the original document DTD) for different user groups. We also develop novel algorithms for XPath query rewriting and optimization such that queries over security views can be efficiently answered without materializing the views. Our algorithms transform a query over a security view to an equivalent query over the original document, and effectively prune query nodes by exploiting the structural properties of the document DTD in conjunction with approximate XPath containment tests. Our work is the first to study a flexible, DTD-based access-control model for XML and its implications on the XML query-execution engine. Furthermore, it is among the first efforts for query rewriting and optimization in the presence of general DTDs for a rich a class of XPath queries. An empirical study based on real-life DTDs verifies the effectiveness of our approach. expand
|
|
|
SESSION: Research sessions: moving objects |
|
|
|
|
Indexing spatio-temporal trajectories with Chebyshev polynomials |
| |
Yuhan Cai,
Raymond Ng
|
|
Pages: 599-610 |
|
doi>10.1145/1007568.1007636 |
|
Full text: PDF
|
|
In this paper, we attempt to approximate and index a d- dimensional (d ≥ 1) spatio-temporal trajectory with a low order continuous polynomial. There are many possible ways to choose the polynomial, including (continuous)Fourier transforms, ...
In this paper, we attempt to approximate and index a d- dimensional (d ≥ 1) spatio-temporal trajectory with a low order continuous polynomial. There are many possible ways to choose the polynomial, including (continuous)Fourier transforms, splines, non-linear regressino, etc. Some of these possiblities have indeed been studied beofre. We hypothesize that one of the best possibilities is the polynomial that minimizes the maximum deviation from the true value, which is called the minimax polynomial. Minimax approximation is particularly meaningful for indexing because in a branch-and-bound search (i.e., for finding nearest neighbours), the smaller the maximum deviation, the more pruning opportunities there exist. However, in general, among all the polynomials of the same degree, the optimal minimax polynomial is very hard to compute. However, it has been shown thta the Chebyshev approximation is almost identical to the optimal minimax polynomial, and is easy to compute [16]. Thus, in this paper, we explore how to use the Chebyshev polynomials as a basis for approximating and indexing d-dimenstional trajectories.The key analytic result of this paper is the Lower Bounding Lemma. that is, we show that the Euclidean distance between two d-dimensional trajectories is lower bounded by the weighted Euclidean distance between the two vectors of Chebyshev coefficients. this lemma is not trivial to show, and it ensures that indexing with Chebyshev cofficients aedmits no false negatives. To complement that analystic result, we conducted comprehensive experimental evaluation with real and generated 1-dimensional to 4-dimensional data sets. We compared the proposed schem with the Adaptive Piecewise Constant Approximation (APCA) scheme. Our preliminary results indicate that in all situations we tested, Chebyshev indexing dominates APCA in pruning power, I/O and CPU costs. expand
|
|
|
Prediction and indexing of moving objects with unknown motion patterns |
| |
Yufei Tao,
Christos Faloutsos,
Dimitris Papadias,
Bin Liu
|
|
Pages: 611-622 |
|
doi>10.1145/1007568.1007637 |
|
Full text: PDF
|
|
Existing methods for peediction spatio-temporal databases assume that objects move according to linear functions. This severely limits their applicability, since in practice movement is more complex, and individual objects may follow drastically diffferent ...
Existing methods for peediction spatio-temporal databases assume that objects move according to linear functions. This severely limits their applicability, since in practice movement is more complex, and individual objects may follow drastically diffferent motion patterns. In order to overcome these problems, we first introduce a general framework for monitoring and indexing moving objects, where (i) each boject computes individually the function that accurately captures its movement and (ii) a server indexes the object locations at a coarse level and processes queries using a filter-refinement mechanism. Our second contribution is a novel recursive motion function that supports a broad class of non-linear motion patterns. The function does not presume any a-priori movement but can postulate the particular motion of each object by examining its locations at recent timestamps. Finally. we propse an efficient indexing scheme that faciliates the processing of predicitive queries without false misses. expand
|
|
|
SINA: scalable incremental processing of continuous queries in spatio-temporal databases |
| |
Mohamed F. Mokbel,
Xiaopeing Xiong,
Walid G. Aref
|
|
Pages: 623-634 |
|
doi>10.1145/1007568.1007638 |
|
Full text: PDF
|
|
This paper intoduces the Scalable INcremental hash-based Algorithm (SINA, for short); a new algorithm for evaluting a set of concurrent continuous spatio-temporal queries. SINA is designed with two goals in mind: (1) Scalability in terms of the ...
This paper intoduces the Scalable INcremental hash-based Algorithm (SINA, for short); a new algorithm for evaluting a set of concurrent continuous spatio-temporal queries. SINA is designed with two goals in mind: (1) Scalability in terms of the number of concurrent continuous spatio-temporal queries, and (2) Incremental evaluation of continyous spatio-temporal queries. SINA achieves scalability by empolying a shared execution paradigm where the execution of continuous spatio-temporal queries is abstracted as a spatial join between a set of moving objects and a set of moving queries. Incremental evaluation is achived by computing only the updates of the previously reported answer. We introduce two types of updaes, namely positive and negative updates. Positive or negative updates indicate that a certain object should be added to or removed from the previously reported answer, respectively. SINA manages the computation of postive and negative updates via three phases: the hashing phase, the invalidation phase, and the joining phase. the hashing phase employs an in-memory hash-based join algorithm that results in a set a positive upldates. The invalidation phase is triggered every T seconds or when the memory is fully occupied to produce a set of negative updates. Finally, the joining phase is triggered by the end of the invalidation phase to produce a set of both positive and negative updates that result from joining in-memory data with in-disk data. Experimental results show that SINA is scalable and is more efficient than other index-based spatio-temporal algorithms. expand
|
|
|
STRIPES: an efficient index for predicted trajectories |
| |
Jignesh M. Patel,
Yun Chen,
V. Prasad Chakka
|
|
Pages: 635-646 |
|
doi>10.1145/1007568.1007639 |
|
Full text: PDF
|
|
Moving object databases are required to support queries on a large number of continuously moving objects. A key requirement for indexing methods in this domain is to efficiently support both update and query operations. Previous work on indexing such ...
Moving object databases are required to support queries on a large number of continuously moving objects. A key requirement for indexing methods in this domain is to efficiently support both update and query operations. Previous work on indexing such databases can be broadly divided into categories: indexing the past positions and indexing the future predicted positions. In this paper we focus on an efficient indexing method for indexing the future positions of moving objects.In this paper we propose an indexing method, called STRIPES, which indexes predicted trajectories in a dual transformed space. Trajectories for objects in d-dimensional space become points in a higher-dimensional 2d-space. This dual transformed space is then indexed using a regular hierarchical grid decomposition indexing structure. STRIPES can evaluate a range of queries including time-slice, window, and moving queries. We have carried out extensive experimental evaluation comparing the performance of STRIPES with the best known existing predicted trajectory index (the TPR*-tree), and show that our approach is significantly faster than TPR*-tree for both updates and search queries. expand
|
|
|
SESSION: Research sessions: query optimization |
|
|
|
|
CORDS: automatic discovery of correlations and soft functional dependencies |
| |
Ihab F. Ilyas,
Volker Markl,
Peter Haas,
Paul Brown,
Ashraf Aboulnaga
|
|
Pages: 647-658 |
|
doi>10.1145/1007568.1007641 |
|
Full text: PDF
|
|
The rich dependency structure found in the columns of real-world relational databases can be exploited to great advantage, but can also cause query optimizers---which usually assume that columns are statistically independent---to underestimate the selectivities ...
The rich dependency structure found in the columns of real-world relational databases can be exploited to great advantage, but can also cause query optimizers---which usually assume that columns are statistically independent---to underestimate the selectivities of conjunctive predicates by orders of magnitude. We introduce CORDS, an efficient and scalable tool for automatic discovery of correlations and soft functional dependencies between columns. CORDS searches for column pairs that might have interesting and useful dependency relations by systematically enumerating candidate pairs and simultaneously pruning unpromising candidates using a flexible set of heuristics. A robust chi-squared analysis is applied to a sample of column values in order to identify correlations, and the number of distinct values in the sampled columns is analyzed to detect soft functional dependencies. CORDS can be used as a data mining tool, producing dependency graphs that are of intrinsic interest. We focus primarily on the use of CORDS in query optimization. Specifically, CORDS recommends groups of columns on which to maintain certain simple joint statistics. These "column-group" statistics are then used by the optimizer to avoid naive selectivity estimates based on inappropriate independence assumptions. This approach, because of its simplicity and judicious use of sampling, is relatively easy to implement in existing commercial systems, has very low overhead, and scales well to the large numbers of columns and large table sizes found in real-world databases. Experiments with a prototype implementation show that the use of CORDS in query optimization can speed up query execution times by an order of magnitude. CORDS can be used in tandem with query feedback systems such as the LEO learning optimizer, leveraging the infrastructure of such systems to correct bad selectivity estimates and ameliorating the poor performance of feedback systems during slow learning phases. expand
|
|
|
Robust query processing through progressive optimization |
| |
Volker Markl,
Vijayshankar Raman,
David Simmen,
Guy Lohman,
Hamid Pirahesh,
Miso Cilimdzic
|
|
Pages: 659-670 |
|
doi>10.1145/1007568.1007642 |
|
Full text: PDF
|
|
Virtually every commercial query optimizer chooses the best plan for a query using a cost model that relies heavily on accurate cardinality estimation. Cardinality estimation errors can occur due to the use of inaccurate statistics, invalid assumptions ...
Virtually every commercial query optimizer chooses the best plan for a query using a cost model that relies heavily on accurate cardinality estimation. Cardinality estimation errors can occur due to the use of inaccurate statistics, invalid assumptions about attribute independence, parameter markers, and so on. Cardinality estimation errors may cause the optimizer to choose a sub-optimal plan. We present an approach to query processing that is extremely robust because it is able to detect and recover from cardinality estimation errors. We call this approach "progressive query optimization" (POP). POP validates cardinality estimates against actual values as measured during query execution. If there is significant disagreement between estimated and actual values, execution might be stopped and re-optimization might occur. Oscillation between optimization and execution steps can occur any number of times. A re-optimization step can exploit both the actual cardinality and partial results, computed during a previous execution step. Checkpoint operators (CHECK) validate the optimizer's cardinality estimates against actual cardinalities. Each CHECK has a condition that indicates the cardinality bounds within which a plan is valid. We compute this validity range through a novel sensitivity analysis of query plan operators. If the CHECK condition is violated, CHECK triggers re-optimization. POP has been prototyped in a leading commercial DBMS. An experimental evaluation of POP using TPC-H queries illustrates the robustness POP adds to query processing, while incurring only negligible overhead. A case-study applying POP to a real-world database and workload shows the potential of POP, accelerating complex OLAP queries by almost two orders of magnitude. expand
|
|
|
Canonical abstraction for outerjoin optimization |
| |
Jun Rao,
Hamid Pirahesh,
Calisto Zuzarte
|
|
Pages: 671-682 |
|
doi>10.1145/1007568.1007643 |
|
Full text: PDF
|
|
Outerjoins are an important class of joins and are widely used in various kinds of applications. It is challenging to optimize queries that contain outerjoins because outerjoins do not always commute with inner joins. Previous work has studied this problem ...
Outerjoins are an important class of joins and are widely used in various kinds of applications. It is challenging to optimize queries that contain outerjoins because outerjoins do not always commute with inner joins. Previous work has studied this problem and provided techniques that allow certain reordering of the join sequences. However, the optimization of outerjoin queries is still not as powerful as that of inner joins.An inner join query can always be canonically represented as a sequence of Cartesian products of all relations, followed by a sequence of selection operations, each applying a conjunct in the join predicates. This canonical abstraction is very powerful because it enables the optimizer to use any join sequence for plan generation. Unfortunately, such a canonical abstraction for outerjoin queries has not been developed. As a result, existing techniques always exclude certain join sequences from planning, which can lead to a severe performance penalty.Given a query consisting of a sequence of inner and outer joins, we, for the first time, present a canonical abstraction based on three operations: outer Cartesian products, nullification, and best match. Like the inner join abstraction, our outerjoin abstraction permits all join sequences, and preserves the property of both commutativity and transitivity among predicates. This allows us to generate plans that are very desirable for performance reasons but that couldn't be done before. We present an algorithm that produces such a canonical abstraction, and a method that extends an inner-join optimizer to generate plans in an expanded search space. We also describe an efficient implementation of the best match operation using the OLAP functionalities in SQL:1999. Our experimental results show that our technique can significantly improve the performance of outerjoin queries. expand
|
|
|
SESSION: Research sessions: spatial data |
|
|
|
|
Joining interval data in relational databases |
| |
Jost Enderle,
Matthias Hampel,
Thomas Seidl
|
|
Pages: 683-694 |
|
doi>10.1145/1007568.1007645 |
|
Full text: PDF
|
|
The increasing use of temporal and spatial data in present-day relational systems necessitates an efficient support of joins on interval-valued attributes. Standard join algorithms do not support those data types adequately, whereas special approaches ...
The increasing use of temporal and spatial data in present-day relational systems necessitates an efficient support of joins on interval-valued attributes. Standard join algorithms do not support those data types adequately, whereas special approaches for interval joins usually require an augmentation of the internal access methods which is not supported by existing relational systems. To overcome these problems we introduce new join algorithms for interval data. Based on the Relational Interval Tree, these algorithms can easily be implemented on top of any relational database system while providing excellent performance on joining intervals. As experimental results on an Oracle9i server show, the new techniques outperform existing relational methods for joining intervals significantly. expand
|
|
|
Approximation techniques for spatial data |
| |
Abhinandan Das,
Johannes Gehrke,
Mirek Riedewald
|
|
Pages: 695-706 |
|
doi>10.1145/1007568.1007646 |
|
Full text: PDF
|
|
Spatial Database Management Systems (SDBMS), e.g., Geographical Information Systems, that manage spatial objects such as points, lines, and hyper-rectangles, often have very high query processing costs. Accurate selectivity estimation during query optimization ...
Spatial Database Management Systems (SDBMS), e.g., Geographical Information Systems, that manage spatial objects such as points, lines, and hyper-rectangles, often have very high query processing costs. Accurate selectivity estimation during query optimization therefore is crucially important for finding good query plans, especially when spatial joins are involved. Selectivity estimation has been studied for relational database systems, but to date has only received little attention in SDBMS. In this paper, we introduce novel methods that permit high-quality selectivity estimation for spatial joins and range queries. Our techniques can be constructed in a single scan over the input, handle inserts and deletes to the database incrementally, and hence they can also be used for processing of streaming spatial data. In contrast to previous approaches, our techniques return approximate results that come with provable probabilistic quality guarantees. We present a detailed analysis and experimentally demonstrate the efficacy of the proposed techniques. expand
|
|
|
Spatially-decaying aggregation over a network: model and algorithms |
| |
Edith Cohen,
Haim Kaplan
|
|
Pages: 707-718 |
|
doi>10.1145/1007568.1007647 |
|
Full text: PDF
|
|
Data items are often associated with a location in which they are present or collected, and their relevance or influence decays with their distance. Aggregate values over such data thus depend on the observing location, where the weight given to each ...
Data items are often associated with a location in which they are present or collected, and their relevance or influence decays with their distance. Aggregate values over such data thus depend on the observing location, where the weight given to each item depends on its distance from that location. We term such aggregation spatially-decaying.Spatially-decaying aggregation has numerous applications: Individual sensor nodes collect readings of an environmental parameter such as contamination level or parking spot availability; the nodes then communicate to integrate their readings so that each location obtains contamination level or parking availability in its neighborhood. Nodes in a p2p network could use a summary of content and properties of nodes in their neighborhood in order to guide search. In graphical databases such as Web hyperlink structure, properties such as subject of pages that can reach or be reached from a page using link traversals provide information on the page.We formalize the notion of spatially-decaying aggregation and develop efficient algorithms for fundamental aggregation functions, including sums and averages, random sampling, heavy hitters, quantiles, and Lp norms. expand
|
|
|
SESSION: Research sessions: schema discovery |
|
|
|
|
TOSS: an extension of TAX with ontologies and similarity queries |
| |
Edward Hung,
Yu Deng,
V. S. Subrahmanian
|
|
Pages: 719-730 |
|
doi>10.1145/1007568.1007649 |
|
Full text: PDF
|
|
TAX is perhaps the best known extension of the relational algebra to handle queries to XML databases. One problem with TAX (as with many existing relational DBMSs) is that the semantics of terms in a TAX DB are not taken into account when answering queries. ...
TAX is perhaps the best known extension of the relational algebra to handle queries to XML databases. One problem with TAX (as with many existing relational DBMSs) is that the semantics of terms in a TAX DB are not taken into account when answering queries. Thus, even though TAX answers queries with 100% precision, the recall of TAX is relatively low. Our TOSS system improves the recall of TAX via the concept of a similarity enhanced ontology (SEO). Intuitively, an ontology is a set of graphs describing relationships (such as isa, partof, etc.) between terms in a DB. An SEO also evaluates how similarities between terms (e.g. "J. Ullman", "Jeff Ullman", and "Jeffrey Ullman") affect ontologies. Finally, we show how the algebra proposed in TAX can be extended to take SEOs into account. The result is a system that provides a much higher answer quality than TAX does alone (quality is defined as the square root of the product of precision and recall). We experimentally evaluate the TOSS system on the DBLP and SIGMOD bibliographic databases and show that TOSS has acceptable performance. expand
|
|
|
Information-theoretic tools for mining database structure from large data sets |
| |
Periklis Andritsos,
Renée J. Miller,
Panayiotis Tsaparas
|
|
Pages: 731-742 |
|
doi>10.1145/1007568.1007650 |
|
Full text: PDF
|
|
Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed ...
Data design has been characterized as a process of arriving at a design that maximizes the information content of each piece of data (or equivalently, one that minimizes redundancy). Information content (or redundancy) is measured with respect to a prescribed model for the data, a model that is often expressed as a set of constraints. In this work, we consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete. Specifically, we consider the problem of finding structural clues in an instance of data, an instance which may contain errors, missing values, and duplicate records. We propose a set of information-theoretic tools for finding structural summaries that are useful in characterizing the information content of the data, and ultimately useful in data design. We provide algorithms for creating these summaries over large, categorical data sets. We study the use of these summaries in one specific physical design task, that of ranking functional dependencies based on their data redundancy. We show how our ranking can be used by a physical data-design tool to find good vertical decompositions of a relation (decompositions that improve the information content of the design). We present an evaluation of the approach on real data sets. expand
|
|
|
SESSION: Research sessions: query uncertainty |
|
|
|
|
Efficient set joins on similarity predicates |
| |
Sunita Sarawagi,
Alok Kirpal
|
|
Pages: 743-754 |
|
doi>10.1145/1007568.1007652 |
|
Full text: PDF
|
|
In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing ...
In this paper we present an efficient, scalable and general algorithm for performing set joins on predicates involving various similarity measures like intersect size, Jaccard-coefficient, cosine similarity, and edit-distance. This expands the existing suite of algorithms for set joins on simpler predicates such as, set containment, equality and non-zero overlap. We start with a basic inverted index based probing method and add a sequence of optimizations that result in one to two orders of magnitude improvement in running time. The algorithm folds in a data partitioning strategy that can work efficiently with an index compressed to fit in any available amount of main memory. The optimizations used in our algorithm generalize to several weighted and unweighted measures of partial word overlap between sets. expand
|
|
|
Automatic categorization of query results |
| |
Kaushik Chakrabarti,
Surajit Chaudhuri,
Seung-won Hwang
|
|
Pages: 755-766 |
|
doi>10.1145/1007568.1007653 |
|
Full text: PDF
|
|
Exploratory ad-hoc queries could return too many answers - a phenomenon commonly referred to as "information overload". In this paper, we propose to automatically categorize the results of SQL queries to address this problem. We dynamically generate ...
Exploratory ad-hoc queries could return too many answers - a phenomenon commonly referred to as "information overload". In this paper, we propose to automatically categorize the results of SQL queries to address this problem. We dynamically generate a labeled, hierarchical category structure - users can determine whether a category is relevant or not by examining simply its label; she can then explore just the relevant categories and ignore the remaining ones, thereby reducing information overload. We first develop analytical models to estimate information overload faced by a user for a given exploration. Based on those models, we formulate the categorization problem as a cost optimization problem and develop heuristic algorithms to compute the min-cost categorization. expand
|
|
|
SESSION: Research sessions: text and DB |
|
|
|
|
When one sample is not enough: improving text database selection using shrinkage |
| |
Panagiotis G. Ipeirotis,
Luis Gravano
|
|
Pages: 767-778 |
|
doi>10.1145/1007568.1007655 |
|
Full text: PDF
|
|
Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research ...
Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of "shrinkage" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their "unshrunk" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated "relevance judgments," show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well. expand
|
|
|
On the integration of structure indexes and inverted lists |
| |
Raghav Kaushik,
Rajasekar Krishnamurthy,
Jeffrey F. Naughton,
Raghu Ramakrishnan
|
|
Pages: 779-790 |
|
doi>10.1145/1007568.1007656 |
|
Full text: PDF
|
|
Several methods have been proposed to evaluate queries over a native XML DBMS, where the queries specify both path and keyword constraints. These broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; ...
Several methods have been proposed to evaluate queries over a native XML DBMS, where the queries specify both path and keyword constraints. These broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; and approaches based on information-retrieval style inverted lists. We propose a strategy that combines the two forms of auxiliary indexes, and a query evaluation algorithm for branching path expressions based on this strategy. Our technique is general and applicable for a wide range of choices of structure indexes and inverted list join algorithms. Our experiments over the Niagara XML DBMS show the benefit of integrating the two forms of indexes. We also consider algorithmic issues in evaluating path expression queries when the notion of relevance ranking is incorporated. By integrating the above techniques with the Threshold Algorithm proposed by Fagin et al., we obtain instance optimal algorithms to push down top k computation. expand
|
|
|
SESSION: Research sessions: query progress |
|
|
|
|
Toward a progress indicator for database queries |
| |
Gang Luo,
Jeffrey F. Naughton,
Curt J. Ellmann,
Michael W. Watzke
|
|
Pages: 791-802 |
|
doi>10.1145/1007568.1007658 |
|
Full text: PDF
|
|
Many modern software systems provide progress indicators for long-running tasks. These progress indicators make systems more user-friendly by helping the user quickly estimate how much of the task has been completed and when the task will finish. However, ...
Many modern software systems provide progress indicators for long-running tasks. These progress indicators make systems more user-friendly by helping the user quickly estimate how much of the task has been completed and when the task will finish. However, none of the existing commercial RDBMSs provides a non-trival progress indicator for long-running queries. In this paper, we consider the problem of supporting such progress indicators. After discussing the goals and challenges inherent in this problem, we present a set of techniques sufficient for implementing a simple yet useful progress indicator for a large subset of RDBMS queries. We report an initial implementation of these techniques in PostgreSQL. expand
|
|
|
Estimating progress of execution for SQL queries |
| |
Surajit Chaudhuri,
Vivek Narasayya,
Ravishankar Ramamurthy
|
|
Pages: 803-814 |
|
doi>10.1145/1007568.1007659 |
|
Full text: PDF
|
|
Today's database systems provide little feedback to the user/DBA on how much of a SQL query's execution has been completed. For long running queries, such feedback can be very useful, for example, to help decide whether the query should be terminated ...
Today's database systems provide little feedback to the user/DBA on how much of a SQL query's execution has been completed. For long running queries, such feedback can be very useful, for example, to help decide whether the query should be terminated or allowed to run to completion. Although the above requirement is easy to express, developing a robust indicator of progress for query execution is challenging. In this paper, we study the above problem and present techniques that can form the basis for effective progress estimation. The results of experimentally validating our techniques in Microsoft SQL Server are promising. expand
|
|
|
SESSION: Research sessions: consistency and availability |
|
|
|
|
Relaxed currency and consistency: how to say "good enough" in SQL |
| |
Hongfei Guo,
Per-Åke Larson,
Raghu Ramakrishnan,
Jonathan Goldstein
|
|
Pages: 815-826 |
|
doi>10.1145/1007568.1007661 |
|
Full text: PDF
|
|
Despite the widespread and growing use of asynchronous copies to improve scalability, performance and availability, this practice still lacks a firm semantic foundation. Applications are written with some understanding of which queries can use data that ...
Despite the widespread and growing use of asynchronous copies to improve scalability, performance and availability, this practice still lacks a firm semantic foundation. Applications are written with some understanding of which queries can use data that is not entirely current and which copies are "good enough"; however, there are neither explicit requirements nor guarantees. We propose to make this knowledge available to the DBMS through explicit currency and consistency (C&C) constraints in queries and develop techniques so the DBMS can guarantee that the constraints are satisfied. In this paper we describe our model for expressing C&C constraints, define their semantics, and propose SQL syntax. We explain how C&C constraints are enforced in MTCache, our prototype mid-tier database cache, including how constraints and replica update policies are elegantly integrated into the cost-based query optimizer. Consistency constraints are enforced at compile time while currency constraints are enforced at run time by dynamic plans that check the currency of each local replica before use and select sub-plans accordingly. This approach makes optimal use of the cache DBMS while at the same time guaranteeing that applications always get data that is "good enough" for their purpose. expand
|
|
|
Highly available, fault-tolerant, parallel dataflows |
| |
Mehul A. Shah,
Joseph M. Hellerstein,
Eric Brewer
|
|
Pages: 827-838 |
|
doi>10.1145/1007568.1007662 |
|
Full text: PDF
|
|
We present a technique that masks failures in a cluster to provide high availability and fault-tolerance for long-running, parallelized dataflows. We can use these dataflows to implement a variety of continuous query (CQ) applications that require high-throughput, ...
We present a technique that masks failures in a cluster to provide high availability and fault-tolerance for long-running, parallelized dataflows. We can use these dataflows to implement a variety of continuous query (CQ) applications that require high-throughput, 24x7 operation. Examples include network monitoring, phone call processing, click-stream processing, and online financial analysis. Our main contribution is a scheme that carefully integrates traditional query processing techniques for partitioned parallelism with the process-pairs approach for high availability. This delicate integration allows us to tolerate failures of portions of a parallel dataflow without sacrificing result quality. Upon failure, our technique provides quick fail-over, and automatically recovers the lost pieces on the fly. This piecemeal recovery provides minimal disruption to the ongoing dataflow computation and improved reliability as compared to the straight-forward application of the process-pairs technique on a per dataflow basis. Thus, our technique provides the high availability necessary for critical CQ applications. Our techniques are encapsulated in a reusable dataflow operator called Flux, an extension of the Exchange that is used to compose parallel dataflows. Encapsulating the fault-tolerance logic into Flux minimizes modifications to existing operator code and relieves the burden on the operator writer of repeatedly implementing and verifying this critical logic. We present experiments illustrating these features with an implementation of Flux in the TelegraphCQ code base [8]. expand
|
|
|
SESSION: Industrial sessions: database internals - I |
|
|
|
|
Query sampling in DB2 Universal Database |
| |
Jarek Gryz,
Junjie Guo,
Linqi Liu,
Calisto Zuzarte
|
|
Pages: 839-843 |
|
doi>10.1145/1007568.1007664 |
|
Full text: PDF
|
|
Executing ad hoc queries against large databases can be prohibitively expensive. Exploratory analysis of data may not require exact answers to queries, however: results based on sampling the data are often satisfactory. Supporting sampling as a primitive ...
Executing ad hoc queries against large databases can be prohibitively expensive. Exploratory analysis of data may not require exact answers to queries, however: results based on sampling the data are often satisfactory. Supporting sampling as a primitive SQL operator turns out to be difficult because sampling does not commute with many SQL operators.In this paper, we describe an implementation in IBM® DB2® Universal Database (UDB) of a sampling operator that commutes with some SQL operators. As a result, the query with the sampling operator always returns a random sample of the answers and in many cases runs faster than it would have without such an operator. expand
|
|
|
Query processing for SQL updates |
| |
César A. Galindo-Legaria,
Stefano Stefani,
Florian Waas
|
|
Pages: 844-849 |
|
doi>10.1145/1007568.1007665 |
|
Full text: PDF
|
|
A rich set of concepts and techniques has been developed in the context of query processing for the efficient and robust execution of queries. So far, this work has mostly focused on issues related to data-retrieval queries, with a strong backing on ...
A rich set of concepts and techniques has been developed in the context of query processing for the efficient and robust execution of queries. So far, this work has mostly focused on issues related to data-retrieval queries, with a strong backing on relational algebra. However, update operations can also exhibit a number of query processing issues, depending on the complexity of the operations and the volume of data to process. Such issues include lookup and matching of values, navigational vs. set-oriented algorithms and trade-offs between plans that do serial or random I/Os.In this paper we present an overview of the basic techniques used to support SQL DML (Data Manipulation Language) in Microsoft SQL Server. Our focus is on the integration of update operations into the query processor, the query execution primitives required to support updates, and the update-specific considerations to analyze and execute update plans. Full integration of update processing in the query processor provides a robust and flexible framework and leverages existing query processing techniques. expand
|
|
|
Parallel SQL execution in Oracle 10g |
| |
Thierry Cruanes,
Benoit Dageville,
Bhaskar Ghosh
|
|
Pages: 850-854 |
|
doi>10.1145/1007568.1007666 |
|
Full text: PDF
|
|
This paper describes the new architecture and optimizations for parallel SQL execution in the Oracle 10g database. Based on the fundamental shared-disk architecture underpinning Oracle's parallel SQL execution engine since Oracle7, we show in this paper ...
This paper describes the new architecture and optimizations for parallel SQL execution in the Oracle 10g database. Based on the fundamental shared-disk architecture underpinning Oracle's parallel SQL execution engine since Oracle7, we show in this paper how Oracle's engine responds to the challenges of performing in new grid-computing environments. This is made possible by using advanced optimization techniques, which enable Oracle to exploit data and system architecture dynamically without being constrained by them. We show how we have evolved and re-architected our engine in Oracle 10g to make it more efficient and manageable by using a single global parallel plan model. expand
|
|
|
SESSION: Industrial sessions: database internals - II |
|
|
|
|
Data densification in a relational database system |
| |
Abhinav Gupta,
Sankar Subramanian,
Srikanth Bellamkonda,
Tolga Bozkaya,
Nathan Folkert,
Lei Sheng,
Andrew Witkowski
|
|
Pages: 855-859 |
|
doi>10.1145/1007568.1007668 |
|
Full text: PDF
|
|
Data in a relational data warehouse is usually sparse. That is, if no value exists for a given combination of dimension values, no row exists in the fact table. Densities of 0.1-2% are very common. However, users may want to view the data in a dense ...
Data in a relational data warehouse is usually sparse. That is, if no value exists for a given combination of dimension values, no row exists in the fact table. Densities of 0.1-2% are very common. However, users may want to view the data in a dense form, with rows for all combination of dimension values displayed even when no fact data exists for them. For example, if a product did not sell during a particular time period, users may still want to see the product for that time period with zero sales value next to it. Moreover, analytic window functions [1] and the SQL model clause [2] can more easily express time series calculations if data is dense along the time dimension because dense data will fill a consistent number of rows for each period.Data densification is the process of converting spare data into dense form. The current SQL technique for densification (using the combination of DISTINCT, CROSS JOIN and OUTER JOIN operations) is extremely unintuitive, difficult to express and inefficient to compute. Hence, we propose an extension to the ANSI SQL join operator, referred to as "PARTITIONED OUTER JOIN", which allows for a succinct expression of densification along the dimensions of interest. We also present various algorithms to evaluate the new join operator efficiently and compare it with existing methods of doing the equivalent operation. We also define a new window function "LAST_VALUE (IGNORE NULLS)" which is very useful with partitioned outer join. expand
|
|
|
Hosting the .NET Runtime in Microsoft SQL server |
| |
Alazel Acheson,
Mason Bendixen,
José A. Blakeley,
Peter Carlin,
Ebru Ersan,
Jun Fang,
Xiaowei Jiang,
Christian Kleinerman,
Balaji Rathakrishnan,
Gideon Schaller,
Beysim Sezgin,
Ramachandran Venkatesh,
Honggang Zhang
|
|
Pages: 860-865 |
|
doi>10.1145/1007568.1007669 |
|
Full text: PDF
|
|
The integration of the .NET Common Language Runtime (CLR) inside the SQL Server DBMS enables database programmers to write business logic in the form of functions, stored procedures, triggers, data types, and aggregates using modern programming languages ...
The integration of the .NET Common Language Runtime (CLR) inside the SQL Server DBMS enables database programmers to write business logic in the form of functions, stored procedures, triggers, data types, and aggregates using modern programming languages such as C#, Visual Basic, C++, COBOL, and J++. This paper presents three main aspects of this work. First, it describes the architecture of the integration of the CLR inside the SQL Server database process to provide a safe, scalable, secure, and efficient environment to run user code. Second, it describes our approach to defining and enforcing extensibility contracts to allow a tight integration of types, aggregates, functions, triggers, and procedures written in modern languages with the DBMS. Finally, it presents initial performance results showing the efficiency of user-defined types and functions relative to equivalent native DBMS features. expand
|
|
|
Vertical and horizontal percentage aggregations |
| |
Carlos Ordonez
|
|
Pages: 866-871 |
|
doi>10.1145/1007568.1007670 |
|
Full text: PDF
|
|
Existing SQL aggregate functions present important limitations to compute percentages. This article proposes two SQL aggregate functions to compute percentages addressing such limitations. The first function returns one row for each percentage in vertical ...
Existing SQL aggregate functions present important limitations to compute percentages. This article proposes two SQL aggregate functions to compute percentages addressing such limitations. The first function returns one row for each percentage in vertical form like standard SQL aggregations. The second function returns each set of percentages adding 100% on the same row in horizontal form. These novel aggregate functions are used as a framework to introduce the concept of percentage queries and to generate efficient SQL code. Experiments study different percentage query optimization strategies and compare evaluation time of percentage queries taking advantage of our proposed aggregations against queries using available OLAP extensions. The proposed percentage aggregations are easy to use, have wide applicability and can be efficiently evaluated. expand
|
|
|
SESSION: Industrial sessions: Web Services |
|
|
|
|
Models for Web Services tansactions |
| |
Mark Little
|
|
Pages: 872-872 |
|
doi>10.1145/1007568.1007672 |
|
Full text: PDF
|
|
|
|
|
Enabling sovereign information sharing using Web Services |
| |
Rakesh Agrawal,
Dmitri Asonov,
Ramakrishnan Srikant
|
|
Pages: 873-877 |
|
doi>10.1145/1007568.1007673 |
|
Full text: PDF
|
|
Sovereign information sharing allows autonomous entities to compute queries across their databases in such a way that nothing apart from the result is revealed. We describe an implementation of this model using web services infrastructure. Each site ...
Sovereign information sharing allows autonomous entities to compute queries across their databases in such a way that nothing apart from the result is revealed. We describe an implementation of this model using web services infrastructure. Each site participating in sovereign sharing offers a data service that allows database operations to be applied on the tables they own. Of particular interest is the provision for binary operations such as relational joins. Applications are developed by combining these data services. We present performance measurements that show the promise of a new breed of practical applications based on the paradigm of sovereign information integration. expand
|
|
|
Building dynamic application networks with Web Services |
| |
Matthew Mihic
|
|
Pages: 878-878 |
|
doi>10.1145/1007568.1007674 |
|
Full text: PDF
|
|
Looking at the state of the industry today, it is clear that we are in the early stages of Web Services development. Companies are still evaluating what the technology and considering how to apply it to their business. But over the past year, we seem ...
Looking at the state of the industry today, it is clear that we are in the early stages of Web Services development. Companies are still evaluating what the technology and considering how to apply it to their business. But over the past year, we seem to have reached an inflection point of companies building real systems based on Web Services. Partly this reflects an acceptance that the basic Web Services technologies - XML Schema [1][2], SOAP [3], WSDL [4] - have matured to the point where they can be used for mission critical applications. But it also reflects a growing understanding that Web Services enable a large class of systems that were previously very difficult to build. These systems are characterized by several critical properties:1. Rapid rates of change. The time is long past when companies could afford a year-long-effort to build out a new application. Businesses move at a faster pace today then ever before, and they are increasingly under pressure to do more work with fewer resources. This places a premium on the ability to build applications by quickly composing pre-existing services. The result is that systems are being connected in ways that were never imagined during development. This is reuse in the large - not just small services, but entire applications being linked together to solve a complex business function.2. Significant availability and scalability requirements. Many of these systems are "bet-your-business" types of applications. They have heavy scalability and availability requirements. Often then need to connect multiple partners and service hundreds of thousands of updates in a day, without ever suffering an interruption in service.3. Heterogeneous development tools and software platforms. Each of these applications typically involves components built using a wildly diverse set of tools, operating systems, and software platforms. Partly this is a result of building systems out of existing components - many of these components are locked into certain environments, and there are no resources to rewrite or migrate to a single homogenous platform. But it is also recognition that different problems are best solved by different toolsets. Some problems are best solved by writing code on an application server, others are best suited for scripting, and still others are solved by customizing an existing enterprise application. Heterogeneity is not going away. It is only increasing.4. Multiple domains of administrative control. An aspect of heterogeneity that is often overlooked is distributed ownership. As businesses merge, acquire, and partner with other companies, there is an increasing need to build applications that span organizational boundaries.These characteristics present a unique set of challenges to the way we think about developing, describing, connecting, and configuring applications. The challenges require us to develop new ways of looking at what it takes to build an application, and what makes up a network.In this session, we examine the nature of this next generation of application, and discuss the way in which Web Services are evolving to meet their needs. The session focuses on the development techniques that allow services to be easily and dynamically composed into rich applications, and considers the capabilities required of the underlying network fabric. The session concludes with an in-depth look at some of the critical Web Services specifications actively under development by industry leaders. expand
|
|
|
Secure, reliable, transacted: innovation in Web Services architecture |
| |
Martin Gudgin
|
|
Pages: 879-880 |
|
doi>10.1145/1007568.1007675 |
|
Full text: PDF
|
|
This paper discusses the design of Web Services Protocols paying special attention to composition of such protocols. The transaction related protocols are discussed as exemplars.
This paper discusses the design of Web Services Protocols paying special attention to composition of such protocols. The transaction related protocols are discussed as exemplars. expand
|
|
|
SESSION: Industrial sessions: database applications |
|
|
|
|
SoundCompass: a practical query-by-humming system; normalization of scalable and shiftable time-series data and effective subsequence generation |
| |
Naoko Kosugi,
Yasushi Sakurai,
Masashi Morimoto
|
|
Pages: 881-886 |
|
doi>10.1145/1007568.1007677 |
|
Full text: PDF
|
|
This paper describes our practical query-by-humming system, SoundCompass, which is being used as a karaoke song selection system in Japan. First, we describe the fundamental techniques employed by SoundCompass such as normalization in a time-wise ...
This paper describes our practical query-by-humming system, SoundCompass, which is being used as a karaoke song selection system in Japan. First, we describe the fundamental techniques employed by SoundCompass such as normalization in a time-wise sense of music data, time-scalable and tone-shiftable time-series data, and making subsequences for efficient matching. Second, we describe techniques to make effective feature vectors based on real music data and do matching with them to develop accurate query-by-humming. Third, we share valuable knowledge that has been obtained through month's of practical use of Sound Compass. Fourth, we describe the latest version of the SoundCompass system that incorporates these new techniques and knowledge, as well as describe quantitative evaluations that prove the practicality of SoundCompass. The new system provides flexible and accurate similarity retrieval based on k-nearest neighbor searches with multi-dimensional spatial indices structured with multi-dimensional features vectors. expand
|
|
|
Model-driven business UI based on maps |
| |
Per Bendsen
|
|
Pages: 887-891 |
|
doi>10.1145/1007568.1007678 |
|
Full text: PDF
|
|
Future business applications will often have more than 2,000 forms and need to target several user interface (UI) technologies including: Web Browsers, Windows® Applications, PDA's, and cell phones. The applications will need state-of-the-art layout ...
Future business applications will often have more than 2,000 forms and need to target several user interface (UI) technologies including: Web Browsers, Windows® Applications, PDA's, and cell phones. The applications will need state-of-the-art layout combined with excellent usability with specially built forms that handle specific tasks based on user roles. How can the trade-off between developer productivity and user experience be handled?The technologies being implemented in Microsoft® Business Framework include a model-driven business UI platform that exploits flexible maps and a layered form definition. The framework generates forms based on a model of the business logic, which is an integrated part of the business framework. The generation process uses declarative and changeable maps so that the process can be controlled and modified by the business developer. expand
|
|
|
dbSwitch™: towards a database utility |
| |
Shaul Dar,
Gil Hecht,
Eden Shochat
|
|
Pages: 892-896 |
|
doi>10.1145/1007568.1007679 |
|
Full text: PDF
|
|
Savantis Systems' dbSwitch™ is an innovative commercial product providing database server virtualization and advancing a database utility model. The dbSwitch enables a new architecture, called a Database Area Network (DAN), which pools database ...
Savantis Systems' dbSwitch™ is an innovative commercial product providing database server virtualization and advancing a database utility model. The dbSwitch enables a new architecture, called a Database Area Network (DAN), which pools database server resources and shares them among multiple database applications. Specific benefits of the DAN architecture for enterprise data centers include server consolidation, improved utilization, high availability and capacity management. We describe the major components of the dbSwitch, namely routing of application requests to database instances, optimization of database server resources and capacity visualization and manipulation. We also relate dbSwitch to recent work on utility and grid computing. expand
|
|
|
SESSION: Industrial sessions: information assurance challenges |
|
|
|
|
Requirements and policy challenges in highly secure environments |
| |
Dean E. Hall
|
|
Pages: 897-898 |
|
doi>10.1145/1007568.1007681 |
|
Full text: PDF
|
|
|
|
|
Information assurance technical challenges |
| |
Nicholas J. Multari
|
|
Pages: 899-899 |
|
doi>10.1145/1007568.1007682 |
|
Full text: PDF
|
|
|
|
|
Service-oriented BI: towards tight integration of business intelligence into operational applications |
| |
Marcus Dill,
Achim Kraiss,
Stefan Sigg,
Thomas Zurek
|
|
Pages: 900-900 |
|
doi>10.1145/1007568.1007683 |
|
Full text: PDF
|
|
|
|
|
SESSION: Industrial sessions: the marriage of XML and relational databases |
|
|
|
|
XML in the middle: XQuery in the WebLogic Platform |
| |
Michael J. Carey
|
|
Pages: 901-902 |
|
doi>10.1145/1007568.1007685 |
|
Full text: PDF
|
|
The BEA WebLogic Platform product suite consists of WebLogic Server, WebLogic Workshop, WebLogic Integration, WebLogic Portal, and Liquid Data for WebLogic. W3C standards including XML, XML Schema, and the emerging XML query language XQuery play important ...
The BEA WebLogic Platform product suite consists of WebLogic Server, WebLogic Workshop, WebLogic Integration, WebLogic Portal, and Liquid Data for WebLogic. W3C standards including XML, XML Schema, and the emerging XML query language XQuery play important roles in several of these products. This industrial presentation will discuss the increasingly central role of XML in the middle tier of enterprise IT architectures and cover some of the key XML technologies that the BEA WebLogic Platform provides for creating enterprise applications in today's IT world. We focus in particular on how XQuery fits into this picture, both for today's WebLogic Platform 8.1 and going forward in terms of the Platform roadmap. expand
|
|
|
ORDPATHs: insert-friendly XML node labels |
| |
Patrick O'Neil,
Elizabeth O'Neil,
Shankar Pal,
Istvan Cseri,
Gideon Schaller,
Nigel Westbury
|
|
Pages: 903-908 |
|
doi>10.1145/1007568.1007686 |
|
Full text: PDF
|
|
We introduce a hierarchical labeling scheme called ORDPATH that is implemented in the upcoming version of Microsoft® SQL Server™. ORDPATH labels nodes of an XML tree without requiring a schema (the most general case---a schema simplifies the ...
We introduce a hierarchical labeling scheme called ORDPATH that is implemented in the upcoming version of Microsoft® SQL Server™. ORDPATH labels nodes of an XML tree without requiring a schema (the most general case---a schema simplifies the problem). An example of an ORDPATH value display format is "1.5.3.9.1". A compressed binary representation of ORDPATH provides document order by simple byte-by-byte comparison and ancestry relationship equally simply. In addition, the ORDPATH scheme supports insertion of new nodes at arbitrary positions in the XML tree, their ORDPATH values "careted in" between ORDPATHs of sibling nodes, without relabeling any old nodes. expand
|
|
|
DEMONSTRATION SESSION: Web services |
|
|
|
|
Declarative specification of Web applications exploiting Web services and workflows |
| |
Marco Brambilla,
Stefano Ceri,
Sara Comai,
Marco Dario,
Piero Fraternali,
Ioana Manolescu
|
|
Pages: 909-910 |
|
doi>10.1145/1007568.1007688 |
|
Full text: PDF
|
|
This demo presents an extension of a declarative language for specifying data-intensive Web applications. We demonstrate a scenario extracted from a real-life application, the Web portal of a computer manufacturer, including interactions with third-party ...
This demo presents an extension of a declarative language for specifying data-intensive Web applications. We demonstrate a scenario extracted from a real-life application, the Web portal of a computer manufacturer, including interactions with third-party service providers and enabling distributors to participate in well-defined business processes. The crucial advantage of our framework is the high-level modeling of a complex Web application, extended with Web service and workflow capabilities. The application is automatically verified for correctness and the code is automatically generated and deployed. expand
|
|
|
Yoo-Hoo!: building a presence service with XQuery and WSDL |
| |
Mary Fernández,
Nicola Onose,
Jérôme Siméon
|
|
Pages: 911-912 |
|
doi>10.1145/1007568.1007689 |
|
Full text: PDF
|
|
|
|
|
DEMONSTRATION SESSION: Data integration |
|
|
|
|
Knocking the door to the deep Web: integrating Web query interfaces |
| |
Bin He,
Zhen Zhang,
Kevin Chen-Chuan Chang
|
|
Pages: 913-914 |
|
doi>10.1145/1007568.1007691 |
|
Full text: PDF
|
|
|
|
|
Efficient development of data migration transformations |
| |
Paulo Carreira,
Helena Galhardas
|
|
Pages: 915-916 |
|
doi>10.1145/1007568.1007692 |
|
Full text: PDF
|
|
In this paper, we present a data migration tool named DATA FUSION. Its main features are: A domain specific language designed to conveniently model complex data transformations; an integrated development environment that assists users on managing complex ...
In this paper, we present a data migration tool named DATA FUSION. Its main features are: A domain specific language designed to conveniently model complex data transformations; an integrated development environment that assists users on managing complex data transformation projects and an auditing facility that provides relevant information to project managers and external auditors. expand
|
|
|
Liquid data for WebLogic: integrating enterprise data and services |
| |
Vinayak Borkar
|
|
Pages: 917-918 |
|
doi>10.1145/1007568.1007693 |
|
Full text: PDF
|
|
Information in today's enterprises commonly resides in a variety of heterogeneous data sources, including relational databases, web services, files, packaged applications, and custom data repositories. BEA's enterprise information integration product, ...
Information in today's enterprises commonly resides in a variety of heterogeneous data sources, including relational databases, web services, files, packaged applications, and custom data repositories. BEA's enterprise information integration product, Liquid Data for WebLogic, takes an XML-based approach to providing integrated access to such heterogeneous information. This demonstration highlights the XML technologies involved - including web services, XQuery, and XML Schema - and shows how they can be brought to bear on the enterprise information integration problem. The demonstration uses a simple end-to-end example, one that involves integrating data from relational databases and web services, to walk the audience through the overall architecture, XML-based data modeling approach, programming model, declarative query and view facilities, and distributed processing features of Liquid Data. expand
|
|
|
DEMONSTRATION SESSION: Data mining |
|
|
|
|
MAIDS: mining alarming incidents from data streams |
| |
Y. Dora Cai,
David Clutter,
Greg Pape,
Jiawei Han,
Michael Welge,
Loretta Auvil
|
|
Pages: 919-920 |
|
doi>10.1145/1007568.1007695 |
|
Full text: PDF
|
|
|
|
|
FAÇADE: a fast and effective approach to the discovery of dense clusters in noisy spatial data |
| |
Yu Qian,
Gang Zhang,
Kang Zhang
|
|
Pages: 921-922 |
|
doi>10.1145/1007568.1007696 |
|
Full text: PDF
|
|
FAÇADE (Fast and Automatic Clustering Approach to Data Engineering) is a spatial clustering tool that can discover clusters of different sizes, shapes, and densities in noisy spatial data. Compared with the existing clustering methods, FAÇADE ...
FAÇADE (Fast and Automatic Clustering Approach to Data Engineering) is a spatial clustering tool that can discover clusters of different sizes, shapes, and densities in noisy spatial data. Compared with the existing clustering methods, FAÇADE has several advantages: first, it separates true data and noise more effectively. Second, most steps of FAÇADE are automatic. Third, it requires only O(nlogn) time. 2D and 3D visualizations are used in FAÇADE to assist parameter selection and result evaluation. More information on FAÇADE is available at: http://viscomp.utdallas.edu/FACADE. expand
|
|
|
DataMIME™ |
| |
Masum Serazi,
Vasily Malakhov,
Dongmei Ren,
Amal Perera,
Imad Rahal,
Weihua Wu,
Qiang Ding,
Fei Pan,
William Perrizo
|
|
Pages: 923-924 |
|
doi>10.1145/1007568.1007697 |
|
Full text: PDF
|
|
|
|
|
DEMONSTRATION SESSION: Streams |
|
|
|
|
PIPES: a public infrastructure for processing and exploring streams |
| |
Jürgen Krämer,
Bernhard Seeger
|
|
Pages: 925-926 |
|
doi>10.1145/1007568.1007699 |
|
Full text: PDF
|
|
PIPES is a flexible and extensible infrastructure providing fundamental building blocks to implement a data stream management system (DSMS). It is seamlessly integrated into the Java library XXL [1, 2, 3] for advanced query processing and extends XXL's ...
PIPES is a flexible and extensible infrastructure providing fundamental building blocks to implement a data stream management system (DSMS). It is seamlessly integrated into the Java library XXL [1, 2, 3] for advanced query processing and extends XXL's scope towards continuous data-driven query processing over autonomous data sources. expand
|
|
|
Web-CAM: monitoring the dynamic Web to respond to continual queries |
| |
Shaveen Garg,
Krithi Ramamritham,
Soumen Chakrabarti
|
|
Pages: 927-928 |
|
doi>10.1145/1007568.1007700 |
|
Full text: PDF
|
|
|
|
|
Load management and high availability in the Medusa distributed stream processing system |
| |
Magdalena Balazinska,
Hari Balakrishnan,
Michael Stonebraker
|
|
Pages: 929-930 |
|
doi>10.1145/1007568.1007701 |
|
Full text: PDF
|
|
Medusa [3, 6] is a distributed stream processing system based on the Aurora single-site stream processing engine [1]. We demonstrate how Medusa handles time-varying load spikes and provides high availability in the face of network partitions. We demonstrate ...
Medusa [3, 6] is a distributed stream processing system based on the Aurora single-site stream processing engine [1]. We demonstrate how Medusa handles time-varying load spikes and provides high availability in the face of network partitions. We demonstrate Medusa in the context of Borealis, a second generation stream processing engine based on Aurora and Medusa. expand
|
|
|
StreaMon: an adaptive engine for stream query processing |
| |
Shivnath Babu,
Jennifer Widom
|
|
Pages: 931-932 |
|
doi>10.1145/1007568.1007702 |
|
Full text: PDF
|
|
StreaMon is the adaptive query processing engine of the STREAM prototype Data Stream Management System (DSMS) [4]. A fundamental challenge in many DSMS applications (e.g., network monitoring, financial monitoring over stock tickers, ...
StreaMon is the adaptive query processing engine of the STREAM prototype Data Stream Management System (DSMS) [4]. A fundamental challenge in many DSMS applications (e.g., network monitoring, financial monitoring over stock tickers, sensor processing) is that conditions may vary significantly over time. Since queries in these systems are usually long-running, or continuous [4], it is important to consider adaptive approaches to query processing. Without adaptivity, performance may drop drastically as stream data and arrival characteristics, query loads, and system conditions change over time.StreaMon uses several techniques to support adaptive query processing [1, 2, 3]; we demonstrate three of them:•Reducing run-time memory requirements for continuous queries by exploiting stream data and arrival patterns.•Adaptive join ordering for pipelined multiway stream joins, with strong quality guarantees.•Placing subresult caches adaptively in pipelined multiway stream joins to avoid recomputation of intermediate results. expand
|
|
|
DEMONSTRATION SESSION: Peer-to-peer and distributed databases |
|
|
|
|
P2P-DIET: an extensible P2P service that unifies ad-hoc and continuous querying in super-peer networks |
| |
Stratos Idreos,
Manolis Koubarakis,
Christos Tryfonopoulos
|
|
Pages: 933-934 |
|
doi>10.1145/1007568.1007704 |
|
Full text: PDF
|
|
|
|
|
Querying at Internet scale |
| |
Brent Chun,
Joseph M. Hellerstein,
Ryan Huebsch,
Shawn R. Jeffery,
Boon Thau Loo,
Sam Mardanbeigi,
Timothy Roscoe,
Sean Rhea,
Scott Shenker,
Ion Stoica
|
|
Pages: 935-936 |
|
doi>10.1145/1007568.1007705 |
|
Full text: PDF
|
|
We are developing a distributed query processor called PIER, which is designed to run on the scale of the entire Internet. PIER utilizes a Distributed Hash Table (DHT) as its communication substrate in order to achieve scalability, reliability, decentralized ...
We are developing a distributed query processor called PIER, which is designed to run on the scale of the entire Internet. PIER utilizes a Distributed Hash Table (DHT) as its communication substrate in order to achieve scalability, reliability, decentralized control, and load balancing. PIER enhances DHTs with declarative and algebraic query interfaces, and underneath those interfaces implements multihop, in-network versions of joins, aggregation, recursion, and query/result dissemination. PIER is currently being used for diverse applications, including network monitoring, keyword-based filesharing search, and network topology mapping. We will demonstrate PIER's functionality by showing system monitoring queries running on PlanetLab, a testbed of over 300 machines distributed across the globe. expand
|
|
|
Support for relaxed currency and consistency constraints in MTCache |
| |
Hongfei Guo,
Per-Åke Larson,
Raghu Ramakrishnan,
Jonathan Goldstein
|
|
Pages: 937-938 |
|
doi>10.1145/1007568.1007706 |
|
Full text: PDF
|
|
|
|
|
An indexing framework for peer-to-peer systems |
| |
Adina Crainiceanu,
Prakash Linga,
Ashwin Machanavajjhala,
Johannes Gehrke,
Jayavel Shanmugasundaram
|
|
Pages: 939-940 |
|
doi>10.1145/1007568.1007707 |
|
Full text: PDF
|
|
|
|
|
DEMONSTRATION SESSION: XML |
|
|
|
|
XSeq: an indexing infrastructure for tree pattern queries |
| |
Xiaofeng Meng,
Yu Jiang,
Yan Chen,
Haixun Wang
|
|
Pages: 941-942 |
|
doi>10.1145/1007568.1007709 |
|
Full text: PDF
|
|
Given a tree-pattern query, most XML indexing approaches decompose it into multiple sub-queries, and then join their results to provide the answer to the original query. Join operations have been identified as the most time-consuming component in XML ...
Given a tree-pattern query, most XML indexing approaches decompose it into multiple sub-queries, and then join their results to provide the answer to the original query. Join operations have been identified as the most time-consuming component in XML query processing. XSeq is a powerful XML indexing infrastructure which makes tree patterns a first class citizen in XML query processing. Unlike most indexing methods that directly manipulate tree structures, XSeq builds its indexing infrastructure on a much simpler data model: sequences. That is, we represent both XML data and XML queries by structure-encoded sequences. We have shown that this new data representation preserves query equivalence, and more importantly, through subsequence matching, structured queries can be answered directly without resorting to expensive join operations. Moreover, the XSeq infrastructure unifies indices on both the content and the structure of XML documents, hence it achieves an additional performance advantage over methods indexing either just content or structure, or indexing them separately. expand
|
|
|
A TeXQuery-based XML full-text search engine |
| |
Chavdar Botev,
Sihem Amer-Yahia,
Jayavel Shanmugasundaram
|
|
Pages: 943-944 |
|
doi>10.1145/1007568.1007710 |
|
Full text: PDF
|
|
We demonstrate an XML full-text search engine that implements the TeXQuery language. TeXQuery is a powerful full-text search extension to XQuery that provides a rich set of fully composable full-text primitives, such as phrase matching, proximity ...
We demonstrate an XML full-text search engine that implements the TeXQuery language. TeXQuery is a powerful full-text search extension to XQuery that provides a rich set of fully composable full-text primitives, such as phrase matching, proximity distance, stemming and thesauri. TeXQuery enables users to seamlessly query over both structure data and text, by embedding full-text primitives in XQuery and vice versa. TeXQuery also supports a flexible scoring construct that scores query results based on full-text predicates and permits top-k queries. TeXQuery is the precursor of the full-text language extension to XPath 2.0 and XQuery 1.0 currently being developed by W3C. expand
|
|
|
DEMONSTRATION SESSION: Data privacy |
|
|
|
|
"Share your data, keep your secrets." |
| |
Irini Fundulaki,
Arnaud Sahuguet
|
|
Pages: 945-946 |
|
doi>10.1145/1007568.1007712 |
|
Full text: PDF
|
|
|
|
|
Managing healthcare data hippocratically |
| |
Rakesh Agrawal,
Ameet Kini,
Kristen LeFevre,
Amy Wang,
Yirong Xu,
Diana Zhou
|
|
Pages: 947-948 |
|
doi>10.1145/1007568.1007713 |
|
Full text: PDF
|
|
|
|
|
DEMONSTRATION SESSION: Potpourri |
|
|
|
|
LexEQUAL: multilexical matching operator in SQL |
| |
A. Kumaran,
Jayant R. Haritsa
|
|
Pages: 949-950 |
|
doi>10.1145/1007568.1007715 |
|
Full text: PDF
|
|
|
|
|
ITQS: an integrated transport query system |
| |
B. Huang,
Z. Huang,
H. Li,
D. Lin,
H. Lu,
Y. Song
|
|
Pages: 951-952 |
|
doi>10.1145/1007568.1007716 |
|
Full text: PDF
|
|
|
|
|
BODHI: a database habitat for bio-diversity information |
| |
Srikanta J. Bedathur,
Abhijit Kadlag,
Jayant R. Haritsa
|
|
Pages: 953-954 |
|
doi>10.1145/1007568.1007717 |
|
Full text: PDF
|
|
|
|
|
CAMAS: a citizen awareness system for crisis mitigation |
| |
Sharad Mehrotra,
Carter Butts,
Dmitri V. Kalashnikov,
Nalini Venkatasubramanian,
Kemal Altintas,
Ram Hariharan,
Haimin Lee,
Yiming Ma,
Amnon Myers,
Jehan Wickramasuriya,
Ron Eguchi,
Charles Huyck
|
|
Pages: 955-956 |
|
doi>10.1145/1007568.1007718 |
|
Full text: PDF
|
|
|
|
|
PANEL SESSION: Panel |
| |
Christian S. Jensen
|
|
|
|
|
Rethinking the conference reviewing process |
| |
Michael J. Franklin,
Jennifer Widom,
Anastassia Ailamaki,
Philip A. Bernstein,
David DeWitt,
Alon Halevy,
Zachary Ives,
Gerhard Weikum
|
|
Pages: 957-957 |
|
doi>10.1145/1007568.1007720 |
|
Full text: PDF
|
|
|
|
|
TUTORIAL SESSION: Tutorial 1 |
|
|
|
|
Tools for design of composite Web services |
| |
Richard Hull,
Jianwen Su
|
|
Pages: 958-961 |
|
doi>10.1145/1007568.1007722 |
|
Full text: PDF
|
|
|
|
|
TUTORIAL SESSION: Tutorial 2 |
|
|
|
|
Security of shared data in large systems: state of the art and research directions |
| |
Arnon Rosenthal,
Marianne Winslett
|
|
Pages: 962-964 |
|
doi>10.1145/1007568.1007724 |
|
Full text: PDF
|
|
The target audience for this tutorial is the entire SIGMOD research community. The goals of the tutorial are to enlighten the SIGMOD research community about the state of the art in data security, especially for enterprise or larger systems, and ...
The target audience for this tutorial is the entire SIGMOD research community. The goals of the tutorial are to enlighten the SIGMOD research community about the state of the art in data security, especially for enterprise or larger systems, and to engage the community's interest in improving the state of the art. expand
|
|
|
TUTORIAL SESSION: Tutorial 3 |
|
|
|
|
Fast algorithms for time series with applications to finance, physics, music, biology, and other suspects |
| |
Alberto Lerner,
Dennis Shasha,
Zhihua Wang,
Xiaojian Zhao,
Yunyue Zhu
|
|
Pages: 965-968 |
|
doi>10.1145/1007568.1007726 |
|
Full text: PDF
|
|
Financial time series streams are watched closely by millions of traders. What exactly do they look for and how can we help them do it faster? Physicists study the time series emerging from their sensors. The same question holds for them. Musicians produce ...
Financial time series streams are watched closely by millions of traders. What exactly do they look for and how can we help them do it faster? Physicists study the time series emerging from their sensors. The same question holds for them. Musicians produce time series. Consumers may want to compare them. This tutorial presents techniques and case studies for four problems:1. Finding sliding window correlations in financial, physical, and other applications.2. Discovering bursts in large sensor data of gamma rays.3. Matching hums to recorded music, even when people don't hum well.4. Maintaining and manipulating time-ordered data in a database setting.This tutorial draws mostly from the book High Performance Discovery in Time Series: techniques and case studies, Springer-Verlag 2004. You can find the power point slides for this tutorial at http://cs.nyu.edu/cs/faculty/shasha/papers/sigmod04.ppt.The tutorial is aimed at researchers in streams, data mining, and scientific computing. Its applications should interest anyone who works with scientists or financial "quants." The emphasis will be on recent results and open problems. This is a ripe area for further advance. expand
|
|
|
TUTORIAL SESSION: Tutorial 4 |
|
|
|
|
Indexing and mining streams |
| |
Christos Faloutsos
|
|
Pages: 969-969 |
|
doi>10.1145/1007568.1007728 |
|
Full text: PDF
|
|
|