Introduction to Graphical Models Brookes Vision Lab Reading Group.

Presentation on theme: "Introduction to Graphical Models Brookes Vision Lab Reading Group."— Presentation transcript:

Introduction to Graphical Models Brookes Vision Lab Reading Group

Graphical Models To build a complex system using simpler parts. System should be consistent Parts are combined using probability Undirected – Markov random fields Directed – Bayesian Networks

Overview Representation Inference Linear Gaussian Models Approximate inference Learning

Causality : Sprinkler causes wet grass Representation

Conditional Independence Independent of ancestors given parents P(C,S,R,W) = P(C) P(S|C) P(R|C,S) P(W|C,S,R) = P(C) P(S|C) P(R|C) P(W|S,R) Space required for n binary nodes – O(2 n ) without factorization – O(n2 k ) with factorization, k = maximum fan-in

Inference Pr(S=1|W=1) = Pr(S=1,W=1)/Pr(W=1) = 0.2781/0.6471 = 0.430 Pr(R=1|W=1) = Pr(R=1,W=1)/Pr(W=1) = 0.4581/0.6471 = 0.708

Explaining Away S and R compete to explain W=1 S and R are conditionally dependent Pr(S=1|R=1,W=1) = 0.1945

Inference where

Inference Variable elimination Choosing optimal ordering – NP hard Greedy methods work well Computing several marginals Dynamic programming avoids redundant computation Sound familiar ??

Bayes Balls for Conditional Independence

A Unifying (Re)View Linear Gaussian Model (LGM) FA SPCAPCALDS Mixture of Gaussians VQ HMM Continuous-State LGM Basic Model Discrete-State LGM

Basic Model State of a system is a k-vector x (unobserved) Output of a system is a p-vector y (observed) Often k << p Basic model x t+1 = A x t + w y t = C x t + v A is the k x k transition matrix C is a p x k observation matrix w = N(0, Q) v = N(0, R) Noise processes are essential Zero mean w.l.o.g

Degeneracy in Basic Model Structure in Q can be moved to A and C W.l.o.g. Q = I R cannot be restricted as y t are observed Components of x can be reordered arbitrarily. Ordering is based on norms of columns of C. x 1 = N(µ 1, Q 1 ) A and C are assumed to have rank k. Q, R, Q 1 are assumed to be full rank.

Probability Computation P( x t+1 | x t ) = N(A x t, Q ; x t+1 ) P( y t | x t ) = N( C x t, R; y t ) P({x 1,..,x T,{y 1,..,y T }) = P(x 1 ) П P(x t+1 |x t П P(y t |x t ) Negative log probability

Inference Given model parameters {A, C, Q, R, µ 1, Q 1 } Given observations y What can be infered about hidden states x ? Total likelihood Filtering : P (x(t) | {y(1),..., y(t)}) Smoothing: P (x(t) | {y(1),..., y(T)}) Partial smoothing: P (x(t) | {y(1),..., y(t+t')}) Partial prediction: P (x(t) | {y(1),..., y(t-t')}) Intermediate values of recursive methods for computing total likelihood.

Learning Unknown parameters {A, C, Q, R, µ 1, Q 1 } Given observations y Log-likelihood F(Q, Ө) – free energy

EM algorithm Alternate between maximizing F(Q,Ө) w.r.t. Q and Ө. F = L at the beginning of M-step E-step does not change Ө Therefore, likelihood does not decrease.

Continuous-State LGM Static Data ModelingTime-series Modeling No temporal dependence Factor analysis SPCA PCA Time ordering of data crucial LDS (Kalman filter models)

Static Data Modelling A = 0 x = w y = C x + v x 1 = N(0,Q) y = N(0, CQC'+R) Degeneracy in model Learning : EM –R restricted Inference

Factor Analysis Restrict R to be diagonal. Q = I x – factors C – factor loading matrix R – uniqueness Learning – EM, quasi-Newton optimization Inference

SPCA R = єI є – global noise level Columns of C span the principal subspace. Learning – EM algorithm Inference

PCA R = lim є->0 єI Learning –Diagonalize sample covariance of data –Leading k eigenvalues and eigenvectors define C –EM determines leading eigenvectors without diagonalization Inference –Noise becomes infinitesimal –Posterior collapses to a single point

Linear Dynamical Systems Inference – Kalman filter Smoothing – RTS recursions Learning – EM algorithm – C known – Shumway and Stoffer, 1982 – All unknown – Ghahramani and Hinton, 1995

Discrete-State LGM x t+1 = WTA[A x t + w] y t = C x t + v x 1 = WTA[N(µ 1,Q 1 )]

Discrete-State LGM Discrete-state LGM Static Data ModelingTime-series Modeling Mixture of Gaussians VQ HMM

Static Data Modelling A = 0 x = WTA[w] w = N(µ,Q) Y = C x + v л j = P(x = e j ) Nonzero µ for nonuniform л j y = N(C j, R) C j – jth column of C

Mixture of Gaussians Mixing coefficients of cluster л j Mean – columns C j Variance – R Learning: EM (corresponds to ML competitive learning) Inference

Vector Quantization Observation noise becomes infinitesimal Inference problem solved by 1NN rule Euclidean distance for diagonal R Mahalanobis distance for unscaled R Posterior collapses to closest cluster Learning with EM = batch version of k- means

Time-series modelling

HMM Transition matrix T T i,j = P(x t+1 = e j | x t = e i ) For every T, there exist A and Q Filtering : forward recursions Smoothing: forward-backward algorithm Learning: EM (called Baum-Welsh reestimation) MAP state sequences - Viterbi