Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2014.

Similar presentations


Presentation on theme: "Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2014."— Presentation transcript:

1 Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2014

2 Phenomenon/Event could be a linguistic process such as POS tagging or sentiment prediction. Model uses data in order to “predict” future observations w.r.t. a phenomenon Data/Observation Phenomenon/Event Model

3 Notations X : x 1, x 2, x 3.... x m (m observations) A: Random variable with n possible outcomes such as a 1, a 2, a 3... a n e.g. One coin throw : a 1 = 0, a 2 =1 One dice throw: a 1 =1, a 2 =2, a 3 =3, a 4 =4, a 5 =5, a 6 =6

4 Goal Goal: Estimate P(a i ) = P i Two paths ML ME Are they equivalent?

5 Calculating probability from data Suppose in X: x 1, x 2, x 3.... x m (m observations), a i occurs f(a i ) = f i times e.g. Dice: If outcomes are 1 1 2 3 1 5 3 4 2 1 F(1) = 4, f(2) = 2, f(3) = 2, f(4) = 1, f(5) = 1, f(6)=0 and m = 10 Hence, P 1 = 4/10, P 2 =2/10, P 3 =2/10, P 4 =1/10, P 5 =1/10, P 6 =1/10

6 In general, the task is... Task: Get θ : the probability vector from X

7 MLE MLE: θ* = argmax Pr (X; θ) With i.i.d. (identical independence) assumption, θ* = argmax π Pr (X; θ) Where, θ : θ θ i=1 m P(a 1 )P(a 2 ) P(a n )

8 What is known about: θ : Σ P i = 1 P i >= 0 for all i Introducing Entropy: H(θ)= - Σ P i ln P i i=1 n n Entropy of distribution

9 Some intuition Example with dice Outcomes = 1,2,3,4,5,6 P(1) + P(2)+P(3)+...P(6) = 1 Entropy(Dice) = H(θ)= - Σ P(i) ln P(i) Now, there is a principle called Laplace’s Principle of Unbiased(?) reasoning i=1 6

10 The best estimate for the dice P(1) = P(2) =... P(6) = 1/6 We will now prove it assuming: NO KNOWLEDGE about the dice except that it has six outcomes, each with probability >= 0 and Σ P i = 1

11 What does “best” mean? “BEST” means most consistent with the situiation. “Best” means that these Pi values should be such that they maximize the entropy.

12 Optimization formulation Max. - Σ P(i) log P(i) Subject to: i=1 6 Σ P i = 1 P i >= 0 for i = 1 to 6 i=1 6

13 Solving the optimization (1/2) Using Lagrangian multipliers, the optimization can be written as: Q = - Σ P(i) log P(i) – λ ( Σ P(i) – 1) - Σ βi P(i) i=1 6 6 6 For now, let us ignore the last term. We will come to it later.

14 Solving the optimization (2/2) Differentiating Q w.r.t. P(i), we get δQ/δP(i) = - log (P(i) – 1 – λ Equating to zero, log P(i) + 1 + λ = 0 log P(i) = -(1+ λ) P(i) = e -(1+ λ) This means that to maximize entropy, every P(i) must be equal.

15 This shows that P(1) = P(2) =... P(6) But, P(1) + P(2) +.. + P(6) = 1 Therefore P(1) = P(2) =... P(6) = 1/6

16 Introducing data in the notion of entropy Now, we introduce data: X : x 1, x 2, x 3.... x m (m observations) A: a 1, a 2, a 3... a n (n outcomes) e.g. For a coin, In absence of data: P(H) = P(T) = 1/2... (As shown in the previous proof) However, if data X is observed as follows: Obs-1: H H T H T T H H H (m=10) (n=2) P(H) = 6/10, P(T) = 4/10 Obs-2: T T H T H T H T T T (m=10) (n=2) P(H) = 3/10, P(T) = 7/10 Which of these is a valid estimate?

17 Change in entropy Entropy reduces as data is observed! Emax (uniform distribution) E2 : P(H) = 3/10 E1 : P(H) = 6/10

18 Start of Duality Maximizing entropy in this situation is same as minimizing the `entropy reduction’ distance. i.e. – Minimizing “relative entropy” Emax (Maximum entropy: uniform distribution) Edata (Entropy when Observations are made) Entropy reduction

19 Concluding remarks Thus, in this discussion of ML-ME duality, we will show that: MLE minimizes relative entropy distance from uniform distribution. Question: The entropy corresponds to probability vectors. The distance can be measured by squared distances. WHY relative entropy?


Download ppt "Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2014."

Similar presentations


Ads by Google