Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1

Previous Lecture Sequence Prediction – Input: x = (x 1,…,x M ) – Predict: y = (y 1,…,y M ) – Naïve full multiclass: exponential explosion – Independent multiclass: strong independence assumption Hidden Markov Models – Generative model: P(y i |y i-1 ), P(x i |y i ) – Prediction using Bayes’s Rule + Viterbi – Train using Maximum Likelihood 2

Outline of Today Long Prelude: – Generative vs Discriminative Models – Naïve Bayes Conditional Random Fields – Discriminative version of HMMs 3

Generative vs Discriminative Generative Models: – Joint Distribution: P(x,y) – Uses Bayes’s Rule to predict: argmax y P(y|x) – Can generate new samples (x,y) Discriminative Models: – Conditional Distribution: P(y|x) – Can use model directly to predict: argmax y P(y|x) Both trained via Maximum Likelihood 4 Same thing! Hidden Markov Models Conditional Random Fields Mismatch!

Naïve Bayes Binary (or Multiclass) prediction Model joint distribution (Generative): “Naïve” independence assumption: Prediction via: 5 http://en.wikipedia.org/wiki/Naive_Bayes_classifier

Naïve Bayes Prediction: 6 P(x d =1|y)y=-1y=+1 P(x 1 =1|y)0.50.7 P(x 2 =1|y)0.90.4 P(x 3 =1|y)0.10.5 P(y) P(y=-1) = 0.4 P(y=+1) = 0.6 xP(y=-1|x)P(y=+1|x)Predict (1,0,0)0.4 * 0.5 * 0.1 * 0.9 = 0.0180.6 * 0.7 * 0.6 * 0.5 = 0.126y = +1 (0,1,1)0.4 * 0.5 * 0.9 * 0.1 = 0.0180.6 * 0.3 * 0.4 * 0.5 = 0.036y = +1 (0,1,0)0.4 * 0.5 * 0.9 * 0.9 = 0.1620.6 * 0.3 * 0.4 * 0.5 = 0.036y = -1

Naïve Bayes Matrix Formulation: 7 P(x d =1|y)y=-1y=+1 P(x 1 =1|y)0.50.7 P(x 2 =1|y)0.90.4 P(x 3 =1|y)0.10.5 P(y) P(y=-1) = 0.4 P(y=+1) = 0.6 (Sums to 1) (Each Sums to 1)

Naïve Bayes Train via Max Likelihood: Estimate P(y) and each P(x d |y) from data – Count frequencies 8

Naïve Bayes vs HMMs Naïve Bayes: Hidden Markov Models: HMMs ≈ 1 st order variant of Naïve Bayes! 9 “Naïve” Generative Independence Assumption (just one interpretation…) P(y)

Naïve Bayes vs HMMs Naïve Bayes: Hidden Markov Models: HMMs ≈ 1 st order variant of Naïve Bayes! 10 (just one interpretation…) P(y) “Naïve” Generative Independence Assumption P(x|y)

Summary: Naïve Bayes Joint model of (x,y): – “Naïve” independence assumption each x d Use Bayes’s Rule for prediction: Maximum Likelihood Training: – Count Frequencies 11 “Generative Model” (can sample new data)

Learn Conditional Prob.? Weird to train to maximize: When goal should be to maximize: 12 Breaks independence! Can no longer use count statistics *HMMs suffer same problem In general, you should maximize the likelihood of the model you define! So if you define joint model P(x,y), then maximize P(x,y) on training data. In general, you should maximize the likelihood of the model you define! So if you define joint model P(x,y), then maximize P(x,y) on training data.

Summary: Generative Models Joint model of (x,y): – Compact & easy to train… –...with ind. assumptions E.g., Naïve Bayes & HMMs Maximize Likelihood Training: Mismatch w/ prediction goal: – But hard to maximize P(y|x) 13 Θ often used to denote all parameters of model

Discriminative Models Conditional model: – Directly model prediction goal Maximum Likelihood: Matches prediction goal: What does P(y|x) look like? 14

First Try Model P(y|x) for every possible x Train by counting frequencies Exponential in # input variables D! – Need to assume something… what? 15 P(y=1|x)x1x1 x2x2 0.500 0.701 0.210 0.411

Log Linear Models! (Logistic Regression) “Log-Linear” assumption – Model representation to linear in D – Most common discriminative probabilistic model 16 Prediction: Training: Match!

Naïve Bayes vs Logistic Regression Naïve Bayes: – Strong ind. assumptions – Super easy to train… – …but mismatch with prediction Logistic Regression: – “Log Linear” assumption Often more flexible than Naïve Bayes – Harder to train (gradient desc.)… – …but matches prediction 17 P(y)P(x|y)

Naïve Bayes vs Logistic Regression NB has K parameters for P(y) (i.e., A) LR has K parameters for bias b NB has K*D parameters for P(x|y) (i.e, O) LR has K*D parameters for w Same number of parameters! 18 P(y)P(x|y) Naïve Bayes Logistic Regression Intuition: Both models have same “capacity” NB spends a lot of capacity on P(x) LR spends all of capacity on P(y|x) No Model Is Perfect! (Especially on finite training set) NB will trade off P(y|x) with P(x) LR will fit P(y|x) as well as possible Intuition: Both models have same “capacity” NB spends a lot of capacity on P(x) LR spends all of capacity on P(y|x) No Model Is Perfect! (Especially on finite training set) NB will trade off P(y|x) with P(x) LR will fit P(y|x) as well as possible

GenerativeDiscriminative P(x,y) Joint model over x and y Cares about everything P(y|x) (when probabilistic) Conditional model Only cares about predicting well Naïve Bayes, HMMs Also Topic Models (later) Logistic Regression, CRFs also SVM, Least Squares, etc. Max LikelihoodMax (Conditional) Likelihood (=minimize log loss) Can pick any loss based on y Hinge Loss, Squared Loss, etc. Always ProbabilisticNot Necessarily Probabilistic Certainly never joint over P(x,y) Often strong assumptions Keeps training tractable More flexible assumptions Focuses entire model on P(y|x) Mismatch between train & predict Requires Bayes’s rule Train to optimize predict goal Can sample anythingCan only sample y given x Can handle missing values in xCannot handle missing values in x 19

Conditional Random Fields 20

“Log-Linear” 1 st Order Sequential Model 21 y 0 = special start state Scoring transitionsScoring input features Scoring Function aka “Partition Function”

( ) x = “Fish Sleep” y = (N,V) 22 u N,* u V,* u *,N -21 u *,V 2-2 u *,Start 1 w N,* w V,* w *,Fish 21 w *,Sleep 10 u N,V w V,Fish yexp(F(y,x)) (N,N)exp(1+2-2+1) = exp(2) (N,V)exp(1+2+2+0) = exp(4) (V,N)exp(-1+1+2+1) = exp(3) (V,V)exp(-1+1-2+0) = exp(-2) Z(x) = Sum

23 x = “Fish Sleep” y = (N,V) *hold other parameters fixed

Basic Conditional Random Field Directly models P(y|x) – Discriminative – Log linear assumption – Same #parameters as HMM – 1 st Order Sequential LR How to Predict? How to Train? Extensions? 24 CRF spends all model capacity on P(y|x), rather than P(x,y)

Predict via Viterbi 25 Scoring transitionsScoring input features Maintain length-k prefix solutions Recursively solve for length-(k+1) solutions Predict via best length-M solution

26 Ŷ 1 (V) Ŷ 1 (D) Ŷ 1 (N) Store each Ŷ 1 (T) & F(Ŷ 1 (T),x) Ŷ 2 (V) Ŷ 2 (D) Ŷ 2 (N) Solve: y 1 =V y 1 =D y 1 =N Ŷ 1 (T) is just T

27 Ŷ 1 (V) Ŷ 1 (D) Ŷ 1 (N) Store each Ŷ 1 (T) & F(Ŷ 1 (T),x 1 ) Ŷ 2 (V) Ŷ 2 (D) Ŷ 2 (N) y 1 =N Ŷ 1 (T) is just TEx: Ŷ 2 (V) = (N, V) Solve:

28 Ŷ 1 (V) Ŷ 1 (D) Ŷ 1 (N) Store each Ŷ 1 (T) & F(Ŷ 1 (T),x 1 ) Ŷ 2 (V) Ŷ 2 (D) Ŷ 2 (N) Store each Ŷ 2 (Z) & F(Ŷ 2 (Z),x) Ex: Ŷ 2 (V) = (N, V) Ŷ 3 (V) Ŷ 3 (D) Ŷ 3 (N) Solve: y 2 =V y 2 =D y 2 =N Ŷ 1 (Z) is just Z

29 Ŷ 1 (V) Ŷ 1 (D) Ŷ 1 (N) Store each Ŷ 1 (Z) & F(Ŷ 1 (Z),x 1 ) Ŷ 2 (V) Ŷ 2 (D) Ŷ 2 (N) Store each Ŷ 2 (T) & F(Ŷ 2 (T),x) Ex: Ŷ 2 (V) = (N, V) Ŷ 3 (V) Ŷ 3 (D) Ŷ 3 (N) Store each Ŷ 3 (T) & F(Ŷ 3 (T),x) Ex: Ŷ 3 (V) = (D,N,V) Ŷ L (V) Ŷ L (D) Ŷ L (N) … Ŷ 1 (T) is just T Solve:

Computing P(y|x) Viterbi doesn’t compute P(y|x) – Just maximizes the numerator F(y,x) Also need to compute Z(x) – aka the “Partition Function” 30

Computing Partition Function Naive approach is iterate over all y’ – Exponential time, L M possible y’! Notation: 31 http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf

Matrix Semiring 32 G j (a,b) L+1 Matrix Version of G j G 1:2 G2G2 G2G2 G1G1 G1G1 = G i:j G i+1 GiGi GiGi = GjGj GjGj G j-1 … L+1 Include ‘Start’ http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf

Path Counting Interpretation Interpretation G 1 (a,b) – L+1 start & end locations – Weight of path from ‘b’ to ‘a’ in step 1 G 1:2 (a,b) – Weight of all paths Start in ‘b’ beginning of Step 1 End in ‘a’ after Step 2 33 G1G1 G1G1 G 1:2 G2G2 G2G2 G1G1 G1G1 = http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf

Consider Length-1 (M=1) M=2 General M – Do M (L+1)x(L+1) matrix computations to compute G 1:M – Z(x) = sum column ‘Start’ of G 1:M Computing Partition Function 34 Sum column ‘Start’ of G 1 ! Sum column ‘Start’ of G 1:2 ! G 1:M G2G2 G2G2 G1G1 G1G1 = GMGM GMGM G M-1 … Sum column ‘Start’ of G 1:M ! http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf

Train via Gradient Descent Similar to Logistic Regression – Gradient Descent on negative log likelihood (log loss) First term is easy: – Recall: 35 Θ often used to denote all parameters of model Harder to differentiate! http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf

Differentiating Log Partition 36 Lots of Chain Rule & Algebra! Definition of P(y’|x) Marginalize over all y’ http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf Forward-Backward!

Optimality Condition Consider one parameter: Optimality condition: Frequency counts = Cond. expectation on training data! – Holds for each component of the model – Each component is a “log-linear” model and requires gradient desc. 37

Forward-Backward for CRFs 38 http://www.inference.phy.cam.ac.uk/hmw26/papers/crf_intro.pdf

Path Interpretation 39 α 1 (V) α 1 (D) α 1 (N) α 2 (V) α 2 (D) α 2 (N) α 3 (V) α 3 (D) α 3 (N) α1α1 α2α2 α3α3 “Start” G 1 (V,“Start”) G 1 (N,“Start”) x G 2 (D,N) x G 3 (N,D) Total Weight of paths from “Start” to “V” in 3 rd step β just does it backwards

Matrix Formulation 40 G2G2 G2G2 α1α1 α1α1 α2α2 α2α2 = (G 2 ) T β5β5 β5β5 β6β6 β6β6 = Use Matrices! Fast to compute! Easy to implement!

Path Interpretation: Forward-Backward vs Viterbi Forward (and Backward) sums over all paths – Computes expectation of reaching each state – E.g., total (un-normalized) probability of y 3 =Verb over all possible y 1:2 Viterbi only keeps the best path – Computes best possible path to reaching each state – E.g., single highest probability setting of y 1:3 such that y 3 =Verb 41 Forward Viterbi

Summary: Training CRFs Similar optimality condition as HMMs: – Match frequency counts of model components! – Except HMMs can just set the model using counts – CRFs need to do gradient descent to match counts Run Forward-Backward for expectation – Just like HMMs as well 42

More General CRFs 43 New: Old: Reduction: θ is “flattened” weight vector Can extend φ j (a,b|x) 

More General CRFs 44 1 st order Sequence CRFs:

Example 45 Various attributes of xStack for each label y j =b All 0’s except 1 sub-vector Basic formulation only had first part

Summary: CRFs “Log-Linear” 1 st order sequence model – Multiclass LR + 1 st order components – Discriminative Version of HMMs – Predict using Viterbi, Train using Gradient Descent – Need forward-backward to differentiate partition function 46

Next Week Structural SVMs – Hinge loss for sequence prediction More General Structured Prediction Next Recitation: – Optimizing non-differentiable functions (Lasso) – Accelerated gradient descent Homework 2 due in 12 days – Tuesday, Feb 3 rd at 2pm via Moodle 47

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.

Similar presentations

Presentation on theme: "Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.

Similar presentations

Presentation on theme: "Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1."— Presentation transcript:

Similar presentations

About project

Feedback