skip to main content
article

An Introduction to Variational Methods for Graphical Models

Published:01 November 1999Publication History
Skip Abstract Section

Abstract

This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, the sigmoid belief network, the Boltzmann machine, and several variants of hidden Markov models, in which it is infeasible to run exact inference algorithms. We then introduce variational methods, which exploit laws of large numbers to transform the original graphical model into a simplified graphical model in which inference is efficient. Inference in the simpified model provides bounds on probabilities of interest in the original model. We describe a general framework for generating variational transformations based on convex duality. Finally we return to the examples and demonstrate how variational algorithms can be formulated in each case.

References

  1. Bathe, K. J. (1996). Finite element procedures. Englewood Cliffs, NJ: Prentice-Hall.Google ScholarGoogle Scholar
  2. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41, 164-171.Google ScholarGoogle ScholarCross RefCross Ref
  3. Bishop, C. M., Lawrence, N., Jaakkola, T. S., & Jordan, M. I. (1998). Approximating posterior distributions in belief networks using mixtures. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing systems 10, Cambridge MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: John Wiley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dagum, P., & Luby, M. (1993). Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial Intelligence, 60, 141-153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dayan, P., Hinton, G. E., Neal, R., & Zemel, R. S. (1995). The Helmholtz Machine. Neural Computation, 7, 889-904. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dean, T., & Kanazawa, K. (1989). A model for reasoning about causality and persistence. Computational Intelligence, 5, 142-150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dechter, R. (1999). Bucket elimination: A unifying framework for probabilistic inference. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39, 1-38.Google ScholarGoogle Scholar
  10. Draper, D. L., & Hanks, S. (1994). Localized partial evaluation of belief networks. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.Google ScholarGoogle ScholarCross RefCross Ref
  11. Frey, B., Hinton, G. E., & Dayan, P. (1996). Does the wake-sleep algorithm learn good density estimators? In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  12. Fung, R. & Favero, B. D. (1994). Backward simulation in Bayesian networks. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.Google ScholarGoogle ScholarCross RefCross Ref
  13. Galland, C. (1993). The limitations of deterministic Boltzmann machine learning. Network, 4, 355-379.Google ScholarGoogle ScholarCross RefCross Ref
  14. Ghahramani, Z., & Hinton, G. E. (1996). Switching state-space models. (Technical Report CRG-TR-96-3). Toronto: Department of Computer Science, University of Toronto.Google ScholarGoogle Scholar
  15. Ghahramani, Z., & Jordan, M. I. (1997). Factorial Hidden Markov models. Machine Learning, 29, 245-273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gilks, W., Thomas, A., & Spiegelhalter, D. (1994). A language and a program for complex Bayesian modelling. The Statistician, 43, 169-178.Google ScholarGoogle ScholarCross RefCross Ref
  17. Heckerman, D. (1999). A tutorial on learning with Bayesian networks. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Henrion, M. (1991). Search-based methods to bound diagnostic probabilities in very large belief nets. Uncertainty and Artificial Intelligence: Proceedings of the Seventh Conference. San Mateo, CA: Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hinton, G. E., & Sejnowski, T. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1). Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hinton, G. E., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. Proceedings of the 6th Annual Workshop on Computational Learning Theory. New York, NY: ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hinton, G. E., Dayan, P., Frey, B., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158-1161.Google ScholarGoogle ScholarCross RefCross Ref
  22. Hinton, G. E., Sallans, B., & Ghahramani, Z. (1999). A hierarchical community of experts. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Horvitz, E. J., Suermondt, H. J., & Cooper, G. F. (1989). Bounded conditioning: Flexible inference for decisions under scarce resources. Conference on Uncertainty in Artificial Intelligence: Proceedings of the Fifth Conference. Mountain View, CA: Association for UAI.Google ScholarGoogle Scholar
  24. Jaakkola, T. S., & Jordan, M. I. (1996). Computing upper and lower bounds on likelihoods in intractable networks. Uncertainty and Artificial Intelligence: Proceedings of the Twelth Conference. San Mateo, CA: Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jaakkola, T. S. (1997). Variational methods for inference and estimation in graphical models. Unpublished doctoral dissertation, Massachusetts Institute of Technology, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jaakkola, T. S., & Jordan, M. I. (1997a). Recursive algorithms for approximating probabilities in graphical models. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  27. Jaakkola, T. S., & Jordan, M. I. (1997b). Bayesian logistic regression: A variational approach. In D. Madigan & P. Smyth (Eds.), Proceedings of the 1997 Conference on Artificial Intelligence and Statistics. Ft. Lauderdale, FL.Google ScholarGoogle Scholar
  28. Jaakkola, T. S., & Jordan, M. I. (1999a). Improving the mean field approximation via the use of mixture distributions. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jaakkola, T. S., & Jordan, M. I. (1999b). Variational methods and the QMR-DT database. Journal of Artificial Intelligence Research, 10, 291-322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jensen, C. S., Kong, A., & Kjærulff, U. (1995). Blocking-Gibbs sampling in very large probabilistic expert systems. International Journal of Human-Computer Studies, 42, 647-666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jensen, F. V., & Jensen, F. (1994). Optimal junction trees. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.Google ScholarGoogle ScholarCross RefCross Ref
  32. Jensen, F. V. (1996). An introduction to Bayesian networks. London: UCL Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jordan, M. I. (1994). A statistical approach to decision tree modeling. In M. Warmuth (Ed.), Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory. New York: ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jordan, M. I., Ghahramani, Z., & Saul, L. K. (1997). Hidden Markov decision trees. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  35. Kanazawa, K., Koller, D., & Russell, S. (1995). Stochastic simulation algorithms for dynamic probabilistic networks. Uncertainty and Artificial Intelligence: Proceedings of the Eleventh Conference. San Mateo, CA: Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kjærulff, U. (1990). Triangulation of graphs-Algorithms giving small total state space. (Research Report R-90- 09). Department of Mathematics and Computer Science, Aalborg University, Denmark.Google ScholarGoogle Scholar
  37. Kjærulff, U. (1994). Reduction of computational complexity in Bayesian networks through removal of weak dependences. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.Google ScholarGoogle ScholarCross RefCross Ref
  38. MacKay, D. J. C. (1997). Ensemble learning for hidden Markov models. Unpublished manuscript. Cambridge: Department of Physics, University of Cambridge.Google ScholarGoogle Scholar
  39. McEliece, R. J., MacKay, D. J. C., & Cheng, J.-F. (1998). Turbo decoding as an instance of Pearl's "belief propagation algorithm." IEEE Journal on Selected Areas in Communication, 16, 140-152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Merz, C. J., & Murphy, P. M. (1996). UCI repository of machine learning databases. Irvine, CA: Department of Information and Computer Science, University of California.Google ScholarGoogle Scholar
  41. Neal, R. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56, 71-113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Neal, R. (1993). Probabilistic inference using Markov chain Monte Carlo methods. (Technical Report CRG-TR-93- 1). Toronto: Department of Computer Science, University of Toronto.Google ScholarGoogle Scholar
  43. Neal, R., & Hinton, G. E. (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Parisi, G. (1988). Statistical field theory. Redwood City, CA: Addison-Wesley.Google ScholarGoogle Scholar
  45. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmannn. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Peterson, C., & Anderson, J. R. (1987). A mean field theory learning algorithm for neural networks. Complex Systems, 1, 995-1019.Google ScholarGoogle Scholar
  47. Rockafellar, R. (1972). Convex analysis. Princeton University Press.Google ScholarGoogle Scholar
  48. Rustagi, J. (1976). Variational methods in statistics. New York: Academic Press.Google ScholarGoogle Scholar
  49. Sakurai, J. (1985). Modern quantum mechanics. Redwood City, CA: Addison-Wesley.Google ScholarGoogle Scholar
  50. Saul, L. K., & Jordan, M. I. (1994). Learning in Boltzmann trees. Neural Computation, 6, 1173-1183.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Saul, L. K., Jaakkola, T. S., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61-76. Google ScholarGoogle ScholarCross RefCross Ref
  52. Saul, L. K., & Jordan, M. I. (1996). Exploiting tractable substructures in intractable networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  53. Saul, L. K., & Jordan, M. I. (1999). A mean field learning algorithm for unsupervised neural networks. In M. I. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Seung, S. (1995). Annealed theories of learning. In J.-H. Oh, C. Kwon, S. Cho (Eds.), Neural networks: The statistical mechanics perspectives. Singapore: World Scientific.Google ScholarGoogle Scholar
  55. Shachter, R. D., Andersen, S. K., & Szolovits, P. (1994). Global conditioning for probabilistic inference in belief networks. Uncertainty and Artificial Intelligence: Proceedings of the Tenth Conference. San Mateo, CA: Morgan Kaufmann.Google ScholarGoogle ScholarCross RefCross Ref
  56. Shenoy, P. P. (1992). Valuation-based systems for Bayesian decision analysis. Operations Research, 40, 463-484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Shwe, M. A., & Cooper, G. F. (1991). An empirical analysis of likelihood--Weighting simulation on a large, multiply connected medical belief network. Computers and Biomedical Research, 24, 453-475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Shwe, M. A., Middleton, B., Heckerman, D. E., Henrion, M., Horvitz, E. J., Lehmann, H. P., & Cooper, G. F. (1991). Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base. Meth. Inform. Med., 30, 241-255.Google ScholarGoogle ScholarCross RefCross Ref
  59. Smyth, P., Heckerman, D., & Jordan, M. I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9, 227-270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Waterhouse, S., MacKay, D. J. C., & Robinson, T. (1996). Bayesian methods for mixtures of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8. Cambridge, MA: MIT Press.Google ScholarGoogle Scholar
  61. Williams, C. K. I., & Hinton, G. E. (1991). Mean field networks that learn to discriminate temporally distorted strings. In D. S. Touretzky, J. Elman, T. Sejnowski, & G. E. Hinton (Eds.), Proceedings of the 1990 Connectionist Models Summer School. San Mateo, CA: Morgan Kaufmann.Google ScholarGoogle Scholar

Index Terms

  1. An Introduction to Variational Methods for Graphical Models

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access