skip to main content
10.1145/1265530.1265531acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Management of probabilistic data: foundations and challenges

Published:11 June 2007Publication History

ABSTRACT

Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model, present some fundamental theoretical result for query evaluation on probabilistic databases, and discuss several challenges, open problems, and research directions.

Skip Supplemental Material Section

Supplemental Material

High Resolution
Low Resolution

References

  1. S. Abiteboul and P. Senellart. Querying and updating probabilistic information in XML. In EDBT, pages 1059--1068, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ernest Adams. A Primer of Probability Logic. CSLI Publications, Stanford, California, 1998.Google ScholarGoogle Scholar
  3. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, pages 586--597, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Antova, C. Koch, and D. Olteanu. 10^(10^6) worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007.Google ScholarGoogle Scholar
  6. L. Antova, C. Koch, and D. Olteanu. World-set decompositions: Expressiveness and efficient algorithms. In ICDT, pages 194--208, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Bacchus, A. Grove, J. Halpern, and D. Koller. From statistical knowledge bases to degrees of belief. Artificial Intelligence, 87(1-2):75--143, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Barbara, H. Garcia-Molina, and D. Porter. The management of probabilistic data. IEEE Trans. Knowl. Data Eng., 4(5):487--502, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, pages 953--964, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Borriello and F. Zhao. World-Wide Sensor Web: 2006 UW-MSR Summer Institute Semiahmoo Resort, Blaine, WA, 2006. www.cs.washington.edu/mssi/2006/schedule.html.Google ScholarGoogle Scholar
  12. D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Efficient allocation algorithms for olap over imprecise data. In VLDB, pages 391--402, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In ACM SIGMOD, San Diego, CA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Choudhury, M. Philipose, D. Wyatt, and J. Lester. Towards activity databases: Using sensors and statistical models to summarize people's lives. IEEE Data Eng. Bull, 29(1):49--58, March 2006.Google ScholarGoogle Scholar
  15. W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, pages 73--78, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Cooper. Computational complexity of probabilistic inference using bayesian belief networks (research note). Artificial Intelligence, 42:393--405, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Cowell, P. Dawid, S. Lauritzen, and D. Spiegelhalter, editors. Probabilistic Networks and Expert Systems. Springer, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Dagum and M. Luby. Approximating probabilistic inference in bayesian belief networks is NP-hard. Artificial Intelligence, 60:141--153, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. Dalvi, G. Miklau, and D. Suciu. Asymptotic conditional probabilities for conjunctive queries. In ICDT, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Dalvi, Chris Re, and D. Suciu. Query evaluation on probabilistic databases. IEEE Data Engineering Bulletin, 29(1):25--31, 2006.Google ScholarGoogle Scholar
  21. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, Toronto, Canada, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Dalvi and D. Suciu. Answering queries from statistics and probabilistic views. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Dalvi and D. Suciu. The dichotomy of conjunctive queries on random structures. In PODS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Nilesh Dalvi. Query evaluation on a database given by a random graph. In ICDT, pages 149--163, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Adnan Darwiche. A differential approach to inference in bayesian networks. Journal of the ACM, 50(3):280--305, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Das Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michel de Rougemont. The reliability of queries. In PODS, pages 286--291, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Deshpande, M. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram synopses for high-dimensional data. In SIGMOD, pages 199--210, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, pages 588--599, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Using probabilistic models for data management in acquisitional environments. In CIDR, pages 317--328, 2005.Google ScholarGoogle Scholar
  31. A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Data Management, 29(1):64--72, March 2006.Google ScholarGoogle Scholar
  32. M. Balazinska et al. Data management in the world-wide sensor web. IEEE Pervasive Computing, 2007. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.M. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in KnowItAll: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ivan Felligi and Alan Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  35. M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4):27--33, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Norbert Fuhr and Thomas Roelleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst., 15(1):32--66, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, and Cristian-Augustin Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB, pages 371--380, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Minos Garofalakis and Dan Suciu. Special issue on probabilistic data management. IEEE Data Engineering Bulletin, pages 1--72, 2006.Google ScholarGoogle Scholar
  39. Lise Getoor. An introduction to probabilistic graphical models for relational data. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Data Management, 29(1):32--40, March 2006.Google ScholarGoogle Scholar
  40. E. Grädel, Y. Gurevich, and C. Hirsch. The complexity of query reliability. In PODS, pages 227--234, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. T. Green and V. Tannen. Models for incomplete and probabilistic information. IEEE Data Engineering Bulletin, 29(1):17--24, March 2006.Google ScholarGoogle Scholar
  42. R. Greenlaw, J. Hoover, and W. Ruzzo. Limits to Parallel Computation. P-Completeness Theory. Oxford University Press, New York, Oxford, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. In CMIS Technical Report No. 03/83, 2003.Google ScholarGoogle Scholar
  44. R. Gupta and S. Sarawagi. Creating probabilistic databases from information extraction models. In VLDB, pages 965--976, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Halevy, M. Franklin, and D. Maier. Principles of dataspace systems. In PODS, pages 1--9, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Halevy, A. Rajaraman, and J. Ordille. Data integration: The teenage years. In VLDB, pages 9--16, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J. Halpern. From statistical knowledge bases to degrees of belief: an overview. In PODS, pages 110--113, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. D. Heckerman. Tutorial on graphical models, June 2002.Google ScholarGoogle Scholar
  49. M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In SIGMOD, pages 127--138, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. E. Hung, L. Getoor, and V. S. Subrahmanian. PXML: A probabilistic semistructured data model and algebra. In ICDE, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  51. I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647--658, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. T. S. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In SODA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engineering Bulletin, 29(1):40--48, 2006.Google ScholarGoogle Scholar
  54. S. Jeffery, M. Garofalakis, and M. Franklin. Adaptive cleaning for RFID data streams. In VLDB, pages 163--174, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. R. Karp and M. Luby. Monte-Carlo algorithms for enumeration and reliability problems. In Proceedings of the annual ACM symposium on Theory of computing, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. N. Khoussainova, M. Balazinska, and D. Suciu. Towards correcting input data errors probabilistically using integrity constraints. In MobiDB, pages 43--50, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. P. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS, pages 61--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. Koller. Representation, reasoning, learning. Computers and Thought 2001 Award talk.Google ScholarGoogle Scholar
  59. L. Lakshmanan, N. Leone, R. Ross, and V. S. Subrahmanian. Probview: A flexible probabilistic database system. ACM Trans. Database Syst., 22(3), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. J. Lester, T. Choudhury, N. Kern, G. Borriello, and B. Hannaford. A hybrid discriminative/generative approach for modeling human activities. In IJCAI, pages 766--772, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. J. Madhavan, S. Cohen, X. Dong, A. Halevy, S. Jeffery, D. Ko, and C. Yu. Web-scale data integration: You can afford to pay as you go. In CIDR, pages 342--350, 2007.Google ScholarGoogle Scholar
  62. G. Miklau and D. Suciu. A formal analysis of information disclosure in data exchange. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Radford Neal. Probabilistic inference using Markov Chain Monte Carlo methods. Technical Report CRG-TR-93-1, Univ. of Toronto, 1993.Google ScholarGoogle Scholar
  64. Christos Papadimitriou. Computational Complexity. Addison Wesley Publishing Company, 1994.Google ScholarGoogle Scholar
  65. Judea Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufmann, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. S. Philippi and J. Kohler. Addressing the problems with life-science databases for traditional uses and systems biology. Nature Reviews Genetics, 7:481--488, June 2006.Google ScholarGoogle ScholarCross RefCross Ref
  67. J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput., 12(4):777--788, 1983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. C. Re, N. Dalvi, and D. Suciu. Efficient Top-k query evaluation on probabilistic data. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  69. Christopher Ré. Applications of probabilistic constraints. Technical Reprot TR2007-03-03, University of Washington, Seattle, Washington, March 2007.Google ScholarGoogle Scholar
  70. {70} R. Ross, V. S. Subrahmanian, and J. Grant. Aggregate operators in probabilistic databases. JACM, 52(1), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Sunita Sarawagi. Automation in information extraction and data integration. Tutorial presented at VLDB'2002.Google ScholarGoogle Scholar
  72. Prithviraj Sen and Amol Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  73. W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, pages 862--867, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. D. Suciu and N. Dalvi. Tutorial: Foundations to probabilistic answers to queries. In SIGMOD, 2005. Available from www.cs.washington.edu/homes/suciu. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. L. Valiant. The complexity of enumeration and reliability problems. SIAM J. Comput., 8:410--421, 1979.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. M. van Keulen, A. de Keijzer, and W. Alink. A probabilistic XML approach to data integration. In ICDE, pages 459--470, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. M. Y. Vardi. The complexity of relational query languages. In Proceedings of 14th ACM SIGACT Symposium on the Theory of Computing, pages 137--146, San Francisco, California, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. T. Verma and J. Pearl. Causal networks: Semantics and expressiveness. Uncertainty in Artificial Intelligence, 4:69--76, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI, pages 319--326, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. William Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Bureau of the Census, 1999.Google ScholarGoogle Scholar
  81. Y. Zabiyaka and A. Darwiche. Functional treewidth: Bounding complexity in the presence of functional dependencies. In SAT, pages 116--129, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. alonhalevy.blogspot.com/2007_01_01_archive.html.Google ScholarGoogle Scholar
  83. www.flickr.com.Google ScholarGoogle Scholar
  84. base.google.com.Google ScholarGoogle Scholar
  85. http://www.mturk.com/mturk/welcome.Google ScholarGoogle Scholar

Index Terms

  1. Management of probabilistic data: foundations and challenges

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
              June 2007
              328 pages
              ISBN:9781595936851
              DOI:10.1145/1265530

              Copyright © 2007 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 11 June 2007

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate476of1,835submissions,26%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!