ABSTRACT
Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model, present some fundamental theoretical result for query evaluation on probabilistic databases, and discuss several challenges, open problems, and research directions.
Supplemental Material
- S. Abiteboul and P. Senellart. Querying and updating probabilistic information in XML. In EDBT, pages 1059--1068, 2006. Google Scholar
Digital Library
- Ernest Adams. A Primer of Probability Logic. CSLI Publications, Stanford, California, 1998.Google Scholar
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--597, 2002. Google Scholar
Digital Library
- P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases. In ICDE, 2006. Google Scholar
Digital Library
- L. Antova, C. Koch, and D. Olteanu. 10^(10^6) worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007.Google Scholar
- L. Antova, C. Koch, and D. Olteanu. World-set decompositions: Expressiveness and efficient algorithms. In ICDT, pages 194--208, 2007. Google Scholar
Digital Library
- A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google Scholar
Digital Library
- F. Bacchus, A. Grove, J. Halpern, and D. Koller. From statistical knowledge bases to degrees of belief. Artificial Intelligence, 87(1-2):75--143, 1996. Google Scholar
Digital Library
- D. Barbara, H. Garcia-Molina, and D. Porter. The management of probabilistic data. IEEE Trans. Knowl. Data Eng., 4(5):487--502, 1992. Google Scholar
Digital Library
- O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, pages 953--964, 2006. Google Scholar
Digital Library
- G. Borriello and F. Zhao. World-Wide Sensor Web: 2006 UW-MSR Summer Institute Semiahmoo Resort, Blaine, WA, 2006. www.cs.washington.edu/mssi/2006/schedule.html.Google Scholar
- D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Efficient allocation algorithms for olap over imprecise data. In VLDB, pages 391--402, 2006. Google Scholar
Digital Library
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In ACM SIGMOD, San Diego, CA, 2003. Google Scholar
Digital Library
- T. Choudhury, M. Philipose, D. Wyatt, and J. Lester. Towards activity databases: Using sensors and statistical models to summarize people's lives. IEEE Data Eng. Bull, 29(1):49--58, March 2006.Google Scholar
- W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, pages 73--78, 2003.Google Scholar
Digital Library
- G. Cooper. Computational complexity of probabilistic inference using bayesian belief networks (research note). Artificial Intelligence, 42:393--405, 1990. Google Scholar
Digital Library
- R. Cowell, P. Dawid, S. Lauritzen, and D. Spiegelhalter, editors. Probabilistic Networks and Expert Systems. Springer, 1999. Google Scholar
Digital Library
- P. Dagum and M. Luby. Approximating probabilistic inference in bayesian belief networks is NP-hard. Artificial Intelligence, 60:141--153, 1993. Google Scholar
Digital Library
- N. Dalvi, G. Miklau, and D. Suciu. Asymptotic conditional probabilities for conjunctive queries. In ICDT, 2005. Google Scholar
Digital Library
- N. Dalvi, Chris Re, and D. Suciu. Query evaluation on probabilistic databases. IEEE Data Engineering Bulletin, 29(1):25--31, 2006.Google Scholar
- N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, Toronto, Canada, 2004.Google Scholar
Digital Library
- N. Dalvi and D. Suciu. Answering queries from statistics and probabilistic views. In VLDB, 2005. Google Scholar
Digital Library
- N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on random structures. In PODS, 2007. Google Scholar
Digital Library
- Nilesh Dalvi. Query evaluation on a database given by a random graph. In ICDT, pages 149--163, 2007. Google Scholar
Digital Library
- Adnan Darwiche. A differential approach to inference in bayesian networks. Journal of the ACM, 50(3):280--305, 2003. Google Scholar
Digital Library
- A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In ICDE, 2006. Google Scholar
Digital Library
- Michel de Rougemont. The reliability of queries. In PODS, pages 286--291, 1995. Google Scholar
Digital Library
- A. Deshpande, M. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram synopses for high-dimensional data. In SIGMOD, pages 199--210, 2001. Google Scholar
Digital Library
- A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588--599, 2004. Google Scholar
Digital Library
- A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Using probabilistic models for data management in acquisitional environments. In CIDR, pages 317--328, 2005.Google Scholar
- A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Data Management, 29(1):64--72, March 2006.Google Scholar
- M. Balazinska et al. Data management in the world-wide sensor web. IEEE Pervasive Computing, 2007. To appear. Google Scholar
Digital Library
- O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.M. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in KnowItAll: (preliminary results). In WWW, pages 100--110, 2004. Google Scholar
Digital Library
- Ivan Felligi and Alan Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.Google Scholar
Cross Ref
- M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4):27--33, 2005. Google Scholar
Digital Library
- Norbert Fuhr and Thomas Roelleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst., 15(1):32--66, 1997. Google Scholar
Digital Library
- Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, pages 371--380, 2001. Google Scholar
Digital Library
- Minos Garofalakis and Dan Suciu. Special issue on probabilistic data management. IEEE Data Engineering Bulletin, pages 1--72, 2006.Google Scholar
- Lise Getoor. An introduction to probabilistic graphical models for relational data. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Data Management, 29(1):32--40, March 2006.Google Scholar
- E. Grädel, Y. Gurevich, and C. Hirsch. The complexity of query reliability. In PODS, pages 227--234, 1998. Google Scholar
Digital Library
- T. Green and V. Tannen. Models for incomplete and probabilistic information. IEEE Data Engineering Bulletin, 29(1):17--24, March 2006.Google Scholar
- R. Greenlaw, J. Hoover, and W. Ruzzo. Limits to Parallel Computation. P-Completeness Theory. Oxford University Press, New York, Oxford, 1995. Google Scholar
Digital Library
- L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. In CMIS Technical Report No. 03/83, 2003.Google Scholar
- R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, pages 965--976, 2006. Google Scholar
Digital Library
- A. Halevy, M. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006. Google Scholar
Digital Library
- A. Halevy, A. Rajaraman, and J. Ordille. Data integration: The teenage years. In VLDB, pages 9--16, 2006. Google Scholar
Digital Library
- J. Halpern. From statistical knowledge bases to degrees of belief: an overview. In PODS, pages 110--113, 2006. Google Scholar
Digital Library
- D. Heckerman. Tutorial on graphical models, June 2002.Google Scholar
- M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, pages 127--138, 1995. Google Scholar
Digital Library
- E. Hung, L. Getoor, and V. S. Subrahmanian. PXML: A probabilistic semistructured data model and algebra. In ICDE, 2003.Google Scholar
Cross Ref
- I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647--658, 2004. Google Scholar
Digital Library
- T. S. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In SODA, 2007. Google Scholar
Digital Library
- T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engineering Bulletin, 29(1):40--48, 2006.Google Scholar
- S. Jeffery, M. Garofalakis, and M. Franklin. Adaptive cleaning for RFID data streams. In VLDB, pages 163--174, 2006. Google Scholar
Digital Library
- R. Karp and M. Luby. Monte-Carlo algorithms for enumeration and reliability problems. In Proceedings of the annual ACM symposium on Theory of computing, 1983.Google Scholar
Digital Library
- N. Khoussainova, M. Balazinska, and D. Suciu. Towards correcting input data errors probabilistically using integrity constraints. In MobiDB, pages 43--50, 2006. Google Scholar
Digital Library
- P. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS, pages 61--75, 2005. Google Scholar
Digital Library
- D. Koller. Representation, reasoning, learning. Computers and Thought 2001 Award talk.Google Scholar
- L. Lakshmanan, N. Leone, R. Ross, and V. S. Subrahmanian. Probview: A flexible probabilistic database system. ACM Trans. Database Syst., 22(3), 1997. Google Scholar
Digital Library
- J. Lester, T. Choudhury, N. Kern, G. Borriello, and B. Hannaford. A hybrid discriminative/generative approach for modeling human activities. In IJCAI, pages 766--772, 2005. Google Scholar
Digital Library
- J. Madhavan, S. Cohen, X. Dong, A. Halevy, S. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342--350, 2007.Google Scholar
- G. Miklau and D. Suciu. A formal analysis of information disclosure in data exchange. In SIGMOD, 2004. Google Scholar
Digital Library
- Radford Neal. Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1, Univ. of Toronto, 1993.Google Scholar
- Christos Papadimitriou. Computational Complexity. Addison Wesley Publishing Company, 1994.Google Scholar
- Judea Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufmann, 1988. Google Scholar
Digital Library
- S. Philippi and J. Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nature Reviews Genetics, 7:481--488, June 2006.Google Scholar
Cross Ref
- J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput., 12(4):777--788, 1983.Google Scholar
Digital Library
- C. Re, N. Dalvi, and D. Suciu. Efficient Top-k query evaluation on probabilistic data. In ICDE, 2007.Google Scholar
Cross Ref
- Christopher Ré. Applications of probabilistic constraints. Technical Reprot TR2007-03-03, University of Washington, Seattle, Washington, March 2007.Google Scholar
- {70} R. Ross, V. S. Subrahmanian, and J. Grant. Aggregate operators in probabilistic databases. JACM, 52(1), 2005. Google Scholar
Digital Library
- Sunita Sarawagi. Automation in information extraction and data integration. Tutorial presented at VLDB'2002.Google Scholar
- Prithviraj Sen and Amol Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.Google Scholar
Cross Ref
- W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, pages 862--867, 2005. Google Scholar
Digital Library
- D. Suciu and N. Dalvi. Tutorial: Foundations to probabilistic answers to queries. In SIGMOD, 2005. Available from www.cs.washington.edu/homes/suciu. Google Scholar
Digital Library
- L. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8:410--421, 1979.Google Scholar
Digital Library
- M. van Keulen, A. de Keijzer, and W. Alink. A probabilistic XML approach to data integration. In ICDE, pages 459--470, 2005. Google Scholar
Digital Library
- M. Y. Vardi. The complexity of relational query languages. In Proceedings of 14th ACM SIGACT Symposium on the Theory of Computing, pages 137--146, San Francisco, California, 1982. Google Scholar
Digital Library
- T. Verma and J. Pearl. Causal networks: Semantics and expressiveness. Uncertainty in Artificial Intelligence, 4:69--76, 1990. Google Scholar
Digital Library
- L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI, pages 319--326, 2004. Google Scholar
Digital Library
- William Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, 1999.Google Scholar
- Y. Zabiyaka and A. Darwiche. Functional treewidth: Bounding complexity in the presence of functional dependencies. In SAT, pages 116--129, 2006. Google Scholar
Digital Library
- alonhalevy.blogspot.com/2007_01_01_archive.html.Google Scholar
- www.flickr.com.Google Scholar
- base.google.com.Google Scholar
- http://www.mturk.com/mturk/welcome.Google Scholar
Index Terms
Management of probabilistic data: foundations and challenges
Recommendations
Sensitivity analysis and explanations for robust query evaluation in probabilistic databases
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataProbabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly ...
Top-k best probability queries on probabilistic data
DASFAA'12: Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part IIThere has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalised services, and decision making. In relation to probabilistic data, the most common problem in answering top-k ...
Consensus answers for queries over probabilistic databases
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe address the problem of finding a "best" deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes ...






Comments