ABSTRACT
Conceptually, the common approach to manipulating probabilistic data is to evaluate relational queries and then calculate the probability of each tuple in the result. This approach ignores the possibility that the probabilities of complete answers are too low and, hence, partial answers (with sufficiently high probabilities) become important. Therefore, we consider the semantics in which answers are maximal (i.e., have the smallest degree of incompleteness), subject tothe constraint that the probability is still above a given threshold.
We investigate the complexity of joining relations under the above semantics. In contrast to the deterministic case, this approach gives rise to two different enumeration problems. The first is finding all maximal sets of tuples that are join consistent, connected and have a joint probability above the threshold. The second is computing all maximal tuples that are answers of partial joins and have a probability above the threshold. Both problems are tractable under data complexity. We also consider query-and-data complexity, which rules out as efficient the following naive algorithm: compute all partial answers and then choose the maximal ones among those with probabilities above the threshold. We give efficient algorithms for several, important special cases. We also show that, in general, the first problem is NP-hard whereas the secondis #P-hard.
- D. Barbará, H. Garcia-Molina, and D. Porter. The management of probabilistic data. IEEE Trans. Knowl. Data Eng., 4(5), 1992. Google Scholar
Digital Library
- C. Beeri, R. Fagin, D. Maier, and M. Yannakakis. On the desirability of acyclic database schemes. J. ACM, 30(3), 1983. Google Scholar
Digital Library
- P. A. Bernstein and N. Goodman. Power of natural semijoins. SIAM J. Comput., 10(4), 1981.Google Scholar
- R. Cavallo and M. Pittarelli. The theory of probabilistic databases. In P. M. Stocker, W. Kent, and P. Hammersley, editors, VLDB, 1987. Google Scholar
Digital Library
- S. Cohen, I. Fadida, Y. Kanza, B. Kimelfeld, and Y. Sagiv. Full disjunctions: Polynomial-delay iterators in action. In VLDB, 2006. Google Scholar
Digital Library
- S. Cohen and Y. Sagiv. An abstract framework for generating maximal answers to queries. In ICDT, 2005. Google Scholar
Digital Library
- S. Cohen and Y. Sagiv. An incremental algorithm for computing ranked full disjunctions. In PODS, 2005. Google Scholar
Digital Library
- S. Cohen and Y. Sagiv. An incremental algorithm for computing ranked full disjunctions. To appear in J. Comput. Syst. Sci. (an extended version of {7}), 2006. Google Scholar
Digital Library
- N. N. Dalvi, G. Miklau, and D. Suciu. Asymptotic conditional probabilities for conjunctive queries. In ICDT, 2005. Google Scholar
Digital Library
- N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.Google Scholar
Digital Library
- D. Dey and S. Sarkar. A probabilistic relational model and algebra. ACM Trans. Database Syst., 21(3), 1996. Google Scholar
Digital Library
- R. Fagin. Degrees of acyclicity for hypergraphs and relational database schemes. J. ACM, 30(3), 1983. Google Scholar
Digital Library
- N. Fuhr and T. Rölleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst., 15(1), 1997. Google Scholar
Digital Library
- C. A. Galindo-Legaria. Outerjoins as disjunctions. In SIGMOD, 1994. Google Scholar
Digital Library
- D. Johnson, M. Yannakakis, and C. Papadimitriou. On generating all maximal independent sets. Information Processing Letters, 27, 1988. Google Scholar
Digital Library
- Y. Kanza, W. Nutt, and Y. Sagiv. Querying incomplete information in semistructured data. J. Comput. Syst. Sci., 64(3), 2002.Google Scholar
- Y. Kanza and Y. Sagiv. Computing full disjunctions. In PODS, 2003. Google Scholar
Digital Library
- L. V. S. Lakshmanan, N. Leone, R. Ross, and V. S. Subrahmanian. ProbView: A flexible probabilistic database system. ACM Trans. Database Syst., 22(3), 1997. Google Scholar
Digital Library
- Y. E. Lien. On the equivalence of database models. J. ACM, 29(2), 1982. Google Scholar
Digital Library
- A. Nierman and H. V. Jagadish. ProTDB: Probabilistic data in XML. In VLDB, 2002. Google Scholar
Digital Library
- J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput., 12(4), 1983.Google Scholar
Cross Ref
- A. Rajaraman and J. D. Ullman. Integrating information by outerjoins and full disjunctions. In PODS, 1996. Google Scholar
Digital Library
- S. Toda and M. Ogiwara. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput., 21(2), 1992. Google Scholar
Digital Library
- L. G. Valiant. The complexity of computing the permanent. Theor. Comput. Sci., 8, 1979.Google Scholar
- M. Yannakakis. Algorithms for acyclic database schemes. In VLDB, 1981.Google Scholar
Digital Library
Index Terms
Maximally joining probabilistic data
Recommendations
Computing the top-k maximal answers in a join of ranked lists
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge managementComplex search tasks that utilize information from several data sources, are answered by integrating the results of distinct basic search queries. In such integration, each basic query returns a ranked list of items, and the main task is to compute the ...
Sensitivity analysis and explanations for robust query evaluation in probabilistic databases
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataProbabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly ...
Consensus answers for queries over probabilistic databases
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe address the problem of finding a "best" deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes ...






Comments