Today's Specials ● Detailed look at Lagrange Multipliers ● Forward-Backward and Viterbi algorithms for HMMs ● Intro to EM as a concept [ Motivation, Insights]

Slides:



Advertisements
Similar presentations
Tests of Static Asset Pricing Models
Advertisements

Least squares CS1114
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Expectation Maximization
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Visual Recognition Tutorial
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Maximum likelihood (ML) and likelihood ratio (LR) test
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Lecture 5: Learning models using EM
15 PARTIAL DERIVATIVES.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization Algorithm
Expectation-Maximization
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Hidden Markov Models CMSC 723: Computational Linguistics I ― Session #5 Jimmy Lin The iSchool University of Maryland Wednesday, September 30, 2009.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Hidden Markov Models CMSC 723: Computational Linguistics I ― Session #5 Jimmy Lin The iSchool University of Maryland Wednesday, September 30, 2009.
Maximum likelihood (ML)
Today Wrap up of probability Vectors, Matrices. Calculus
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to Expectation Maximization Assembled and extended by Longin Jan Latecki Temple University, based on slides
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Week 41 How to find estimators? There are two main methods for finding estimators: 1) Method of moments. 2) The method of Maximum likelihood. Sometimes.
CSE 517 Natural Language Processing Winter 2015
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Economics 2301 Lecture 37 Constrained Optimization.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Design and Implementation of Speech Recognition Systems Fall 2014 Ming Li Special topic: the Expectation-Maximization algorithm and GMM Sep Some.
Copyright © Cengage Learning. All rights reserved. 14 Partial Derivatives.
Section Lagrange Multipliers.
Section 15.3 Constrained Optimization: Lagrange Multipliers.
Lagrange Multipliers. Objective: -Understand the method of Lagrange multipliers -Use Lagrange multipliers to solve constrained optimization problems (functions.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
EM Algorithm 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction Example  Missing Data Example  Mixed Attributes Example  Mixture Main Body Mixture Model.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Expectation-Maximization
Ellipse Fitting COMP 4900C Winter 2008.
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Statistical Models for Automatic Speech Recognition
7.5 – Constrained Optimization: The Method of Lagrange Multipliers
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
EM Algorithm 主講人:虞台文.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Today's Specials ● Detailed look at Lagrange Multipliers ● Forward-Backward and Viterbi algorithms for HMMs ● Intro to EM as a concept [ Motivation, Insights]

Lagrange Multipliers ● Why is this used ? ● I am in NLP. Why do I care ? ● How do I use it ? ● Umm, I didn't get it. Show me an example. ● Prove the math. ● Hmm... Interesting !!

Constrained Optimization ● Given a metal wire, f(x,y) : Its temperature T(x,y) = Find the hottest and coldest points on the wire. ● Basically, determine the optima of T subject to the constraint 'f' ● How do you solve this ?

Ha... That's Easy !! ● Let y = and substitute in T ● Solve T for x

How about this one? ● Same T ● But now, ● Still want to solve for y and substitute? ● Didn't think so !

All Hail Lagrange ! ● Lagrange's Multipliers [LM] is a tool to solve such problems [ & live through it ] ● Intuition: – For each constraint 'i', introduce a new scalar variable – L i (the Lagrange Multiplier) – Form a linear combination with these multipliers as coefficients – Problem is now unconstrained and can be solved easily

Use for NLP ● Think EM – The “M” step in the EM algorithm stands for “Maximization” – This maximization is also constrained – Substitution does not work here either ● If you are not sure how important EM is, stick around, we'll tell you !

Vector Calculus 101 ● A gradient of a function is a vector : – Direction : direction of the steepest slope uphill – Magnitude : a measure of steepness of this slope ● Mathematically, the gradient of f(x,y) is: grad(f(x,y)) =

How do I use LM ? ● Follow these steps: – Optimize f, given constraint: g = 0 – Find gradients of 'f' & 'g', grad(f) & grad(g) – Under given conditions, grad(f) = L * grad(g) [proof coming] – This will give 3 equations (one each for x, y and z) – Fourth equation : g = 0 – You now have 4 eqns & 4 variables [x,y,z,L] – Feed this system into a numerical solver – This gives us (x p,y p,z p ) where f is maximum. Find f max – Rejoice !

Examples are for wimps ! What is the largest square that can be inscribed in the ellipse ? (0,0) (-x,-y) (-x,y)(x,y) (x,-y) Area of Square = 4xy

And all that math... ● Maximize f = 4xy subject to g : ● grad(f) = [4y, 4x], grad(g) = [2x, 4y] ● Solve: – 2y – Lx = 0 – x – Ly = 0 – ● Solution : (x p, y p ) = & ● f max =

Why does it work? ● Think of an f, say, a paraboloid ● Its “level curves” will be enclosing circles ● Optima points lie along g and on one of these circles ● 'f' and 'g' MUST be tangent at these points: – If not, then they cross at some point where we can move along g and have a lower or higher value of f – So this cannot be an point of optima, but it is! – Therefore, the 2 curves are tangent. ● Therefore, their gradients(normals) are parallel ● Therefore, grad(f) = L * grad(g)

Expectation Maximization ● We are given data that we assume to be generated by a stochastic process ● We would like to fit a model to this process, i.e., get estimates of model parameters ● These estimates should be such that they maximize the likelihood of the observed data – MLE estimates ● EM does precisely that – and quite efficiently

Obligatory Contrived Example ● Let observed events be grades given out in a class ● Assume that there is a stochastic process generating these grades (yeah... right !) ● P(A) = 1/2, P(B) = µ, P(C) = 2µ, P(D) = ½ – 3µ ● Observations: – Number of A's = 'a' – Number of B's = 'b' – Number of C's = 'c' – Number of D's = 'd' ● What is the ML estimate of 'µ' given a,b,c,d ?

Obligatory Contrived Example ● P(A) = ½, P(B) = µ, P(C) = 2µ, P(D) = ½ – 3µ ● P(Data | Model) = P(a,b,c,d | µ) = K (½) a (µ) b (2µ) c (½-3µ) d = Likelihood ● log P(a,b,c,d | µ) = log K + a log½ + b log µ + c log 2µ + d log(½-3µ) = Log Likelihood [easier to work with this, since we have sums instead of products] ● To maximize this, set ∂LogP/∂µ = 0 ● => µ = ● So, if the class got 10 A's, 6 B's, 9 C's and 10 D's, then µ = 1/10 ● This is the regular and boring way to do it ● Let's make things more interesting...

Obligatory Contrived Example ● P(A) = ½, P(B) = µ, P(C) = 2µ, P(D) = ½ – 3µ ● A part of the information is now hidden: – Number of high grades (A's +B's) = h ● What is an ML estimate of µ now? ● Here is some delicious circular reasoning: – If we knew the value of µ, we could compute the expected values of 'a' and 'b' – If we knew the values of 'a' and 'b', we could compute the ML estimate for µ ● Voila... EM !! EXPECTATION MAXIMIZATION

Obligatory Contrived Example Dance the EM dance – Start with a guess for µ – Iterate between Expectation and Maximization to improve our estimates of µ and b: ● µ(t), b(t) = estimates of µ & b on the t'th iteration ● µ(0) = initial guess ● b(t) = µ(t) / (½ + µ(t)) = E[b | µ] : E-Step ● µ(t) = (b(t) + c) / 6(b(t) + c + d) : M-step [Maximum LE of µ given b(t)] ● Continue iterating until convergence – Good news : It will converge to a maximum. – Bad news : It will converge to a maximum

Where's the intuition? ● Problem: Given some measurement data X, estimate the parameters Ω of the model to be fit to the problem ● Except there are some nuisance “hidden” variables Y which are not observed and which we want to integrate out ● In particular we want to maximize the posterior probability of Ω given data X, marginalizing over Y: ● The E-step can be interpreted as trying to construct a lower bound for this posterior distribution ● The M-step optimizes this bound, thereby improving the estimates for the unknowns

So people actually use it? ● Umm... yeah ! ● Some fields where EM is prevalent: – Medical Imaging – Speech Recognition – Statistical Modelling – NLP – Astrophysics ● Basically anywhere you want to do parameter estimation

... and in NLP ? ● You bet. ● Almost everywhere you use an HMM, you need EM: – Machine Translation – Part-of-speech tagging – Speech Recognition – Smoothing

Where did the math go? We have to do SOMETHING in the next class !!!