An Introduction to Variational Methods for Graphical Models.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Part 2: Unsupervised Learning
CS188: Computational Models of Human Behavior
Perceptron Lecture 4.
1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
1 Chapter 5 Belief Updating in Bayesian Networks Bayesian Networks and Decision Graphs Finn V. Jensen Qunyuan Zhang Division. of Statistical Genomics,
Introduction to Algorithms
Pattern Recognition and Machine Learning
Exact Inference in Bayes Nets
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Dynamic Bayesian Networks (DBNs)
Supervised Learning Recap
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Entropy Rates of a Stochastic Process
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
CS774. Markov Random Field : Theory and Application Lecture 06 Kyomin Jung KAIST Sep
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Recent Development on Elimination Ordering Group 1.
Global Approximate Inference Eran Segal Weizmann Institute.
Conditional Random Fields
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Today Logistic Regression Decision Trees Redux Graphical Models
Computer vision: models, learning and inference Chapter 10 Graphical Models.
. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.
Aspects of Bayesian Inference and Statistical Disclosure Control in Python Duncan Smith Confidentiality and Privacy Group CCSR University of Manchester.
Entropy Rate of a Markov Chain
CSC2535 Spring 2013 Lecture 1: Introduction to Machine Learning and Graphical Models Geoffrey Hinton.
CS774. Markov Random Field : Theory and Application Lecture 08 Kyomin Jung KAIST Sep
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
Probabilistic Graphical Models
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
An Introduction to Variational Methods for Graphical Models
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
CSC321: Neural Networks Lecture 16: Hidden Markov Models
Lecture 2: Statistical learning primer for biologists
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Today Graphical Models Representing conditional dependence graphically
Approximation Algorithms based on linear programming.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.
INTRODUCTION TO Machine Learning 2nd Edition
Chapter 7. Classification and Prediction
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
Inference in Bayesian Networks
Data Mining Lecture 11.
Exact Inference Continued
An Introduction to Variational Methods for Graphical Models
Markov Random Fields Presented by: Vladan Radosavljevic.
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Parametric Methods Berlin Chen, 2005 References:
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Presentation transcript:

An Introduction to Variational Methods for Graphical Models

Introduction (1) Problem of Probabilistic Inference  H : set of hidden nodes  E : set of evidence nodes  P(E) : likelihood Provide satisfactory solution to inference and learning  cases where time or space complexity is unacceptable

Introduction (2) Variational Method  provide approach to the design of approximate inference algorithm  deterministic approximation procedures  Intuition  “Complex graphs can be probabilistically simple.”

Exact Inference (1) Overview of Exact inference for graphical model  Junction Tree Algorithm Directed Graphical Model (Bayesian network)  joint probability distribution of all of the N nodes  P(S) = P(S 1, S 2, …, S N )

Exact Inference (2) Undirected Graphical Model  potential   function on the set of configuration of a clique that associates a positive real number with each configuration  joint probability distribution

Exact Inference (3) Junction Tree Algorithm  Moralization Step  Compiles directed graphical model into undirected graphical models  Triangulation Step  input : moral graph  output : undirected graph in which additional edges have been added –allow recursive calculation of probabilities to take place  triangulated graph –data structure known as junction tree

Junction Tree running intersection property  “If a node appears in any two cliques in the tree, it appears in all cliques that lie on the path between the two cliques.”  local consistency  global consistency  time complexity of probabilistic calculation  depends on the size of the cliques  for discrete data, the number of values to represent the potential is exponential in the number of nodes in the clique.

The QMR-DT database (1) large-scale probabilistic database intended to be used as a diagnostic aid bipartite graphical model  upper layer : diseases  lower layer : symptoms  600 disease nodes and 400 symptom nodes

The QMR-DT database (2) finding  observed symptoms  f : the vector of findings  d : vector of diseases all nodes are binary

The QMR-DT database (3) The joint probability over diseases and findings:  Prior probabilities of the diseases are obtained from archival data.  Conditional Probabilities were obtained from expert assessments under a “noisy-OR” model  q ij are parameters obtained form the expert assessment.

The QMR-DT databse (4) Rewrite the nosy-OR model Joint probability distribution  Negative findings  benign with respect to the inference problem  Positive findings  cross products terms  couple the diseases  coupling terms  exponential growth in inferential complexity  Diagnostic calculation under QMR-DT model is generally infeasible.

Neural networks as graphical models Neural networks  layered graphs endowed with a nonlinear “activation” function at each node  Activation function  bounded zero and one  f(z) = 1 / (1+e -z ) Treat neural network as a graphical model  associating a binary variable S i with each node  interpreting the activation of the node as the probability that the associated binary variable takes one of its two values

Neural networks (2) Example (sigmoid belief network)  ij : parameter associated with the edges between parent nodes j and node i  i0 : bias

Neural networks (3) Exact inference is infeasible in general layered neural network models  a node has as parents all of the nodes in the preceding layer.  Thus the moralized neural network graph has links between all of the nodes in this layer  hidden units in the penultimate layer become probabilistic dependent, as do their ancestors in the preceding hidden layers

Factorial hidden Markov models (1) FHMM  composed of a set of M chains  : state node for the mth chain at time i  A (m) : transition matrix for the mth chain

FHMM (2) Overall transition probability  effective state space for the FHMM  the Cartesian product of the state space associated with the individual chains  Represent a large effective state space with a much smaller number of parameters

FHMM (3) Emission probabilities of the FHMM  Ghahramani and Jordan  B (m) and  : matrices of parameters  the states become stochastically coupled when the outputs are observed.

FHMM (4) Time complexity  N : the number states in each chain  cliques for the hidden state : size N 3  time complexity of exact inference O(N 3 T)  triangulation creates cliques of size N 4  complexity of exact inference : O(N 4 T)  in general : M is number of chain : O(N M+1 T)

Hidden Markov decision trees HMDT (Hidden Markov Decision Tree)  Make decisions in decision tree conditional not only the current data point, but also on the decisions at the previous moment in time  Dependency is assumed to be level-specific.  The probabilities of a decision depends only on the previous decision at the same level of the decision tree  Problem  Given a sequence of input vector U i and sequence of output vector Y i, compute the conditional probability distribution over the hidden states.  intractable (including FHMM)

Basics of variational methodology Variational Methods  Converts a complex problem into a simpler problem  The simpler problem is generally characterized by a decoupling of the degrees of freedom in the original problem  Decoupling is achieved via an expansion of the problem to include additional parameters(variational parameter) that must be fit to the problem.

Examples (1) Express the logarithm function variationally:  : variational parameter

Examples (2) For any given x, for all  Variational transformation provides a family of upper bounds on the logarithm.  The minimum over bounds is the exact value of the logarithm Pragmatic justification  nonlinear function  linear function  cost : a free parameter for each x  if we set well we can obtain a good bound

Example (3) For binary-valued nodes, it is common to represent the probability that the node takes one of its values via a monotonic nonlinearity  Example: logistic regression model  x : weighted sum of the values of the parents of a node  neither convex nor concave  simple linear bound will not work  log concave

Example (4) Bound the log logistic function with linear functions  bound the logistic function by the exponential  H( ) : binary entropy function  Good choice of provide better bound.

Example (5) Significance of the advantage of the transformation  For conditional probabilities represented with logistic regression, we obtain product of functions of the form f(x) = 1 / (1+e -x ).  Augment network representation by including variational parameters, a bound on the joint probability is obtained by taking products of exponentials.

Convex duality (1) General fact of convex analysis  concave function f(x) can be represented via a conjugate or dual function  conjugate function f*( )

Convex duality (2)  f(x) and linear function x for a particular  shift x vertically by an amount which is the minimum of the value x - f(x)  obtain upper bounding line with slope that touches f(x) at a single point

Convex duality (3) Framework of convex duality applies equally well to lower bound. Convex duality is not restricted to linear bounds.

Approximation for join probabilities and conditional probabilities Directed Graphs  Suppose we have a lower bound and upper bound for each of the local conditional probabilities.  We have forms and.  let E and H be a disjoint partition of S

Approximation (2)  Given that upper bound hold for any settings of values the variational parameter, it holds in particular for optimizing settings of the parameters.  The right hand side of the equations  a function to be minimized with respect to.  Distinction between joint prob. and marginal prob.  joint probabilities –If we allow the variational parameters to be set optimally for each value of the argument S, then it is possible to find optimizing settings of the variational parameters that recover the exact value of the joint probability  marginal probabilities –generally not able to recover exact value