Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2015.

Similar presentations


Presentation on theme: "Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2015."— Presentation transcript:

1 Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2015

2 Phenomenon/Event could be a linguistic process such as POS tagging or sentiment prediction. Model uses data in order to “predict” future observations w.r.t. a phenomenon Data/Observation Phenomenon/Event Model

3 Notations X : x 1, x 2, x 3.... x m (m observations) A: Random variable with n possible outcomes such as a 1, a 2, a 3... a n e.g. One coin throw : a 1 = 0, a 2 =1 One dice throw: a 1 =1, a 2 =2, a 3 =3, a 4 =4, a 5 =5, a 6 =6

4 Goal Goal: Estimate P(a i ) = P i Two paths ML ME Are they equivalent?

5 Calculating probability from data Suppose in X: x 1, x 2, x 3.... x m (m observations), a i occurs f(a i ) = f i times e.g. Dice: If outcomes are 1 1 2 3 1 5 3 4 2 1 F(1) = 4, f(2) = 2, f(3) = 2, f(4) = 1, f(5) = 1, f(6)=0 and m = 10 Hence, P 1 = 4/10, P 2 =2/10, P 3 =2/10, P 4 =1/10, P 5 =1/10, P 6 =1/10

6 In general, the task is... Task: Get θ : the probability vector from X

7 MLE MLE: θ* = argmax Pr (X; θ) With i.i.d. (identical independence) assumption, θ* = argmax π Pr (X; θ) Where, θ : θ θ i=1 m P(a 1 )P(a 2 ) P(a n )

8 What is known about: θ : Σ P i = 1 P i >= 0 for all i Introducing Entropy: H(θ)= - Σ P i ln P i i=1 n n Entropy of distribution

9 Some intuition Example with dice Outcomes = 1,2,3,4,5,6 P(1) + P(2)+P(3)+...P(6) = 1 Entropy(Dice) = H(θ)= - Σ P(i) ln P(i) Now, there is a principle called Laplace’s Principle of Unbiased(?) reasoning i=1 6

10 The best estimate for the dice P(1) = P(2) =... P(6) = 1/6 We will now prove it assuming: NO KNOWLEDGE about the dice except that it has six outcomes, each with probability >= 0 and Σ P i = 1

11 What does “best” mean? “BEST” means most consistent with the situation. “Best” means that these P i values should be such that they maximize the entropy.

12 Optimization formulation Max. - Σ P(i) log P(i) Subject to: i=1 6 Σ P i = 1 P i >= 0 for i = 1 to 6 i=1 6

13 Solving the optimization (1/2) Using Lagrangian multipliers, the optimization can be written as: Q = - Σ P(i) log P(i) – λ ( Σ P(i) – 1) - Σ βi P(i) i=1 6 6 6 For now, let us ignore the last term. We will come to it later.

14 Solving the optimization (2/2) Differentiating Q w.r.t. P(i), we get δQ/δP(i) = - log (P(i) – 1 – λ Equating to zero, log P(i) + 1 + λ = 0 log P(i) = -(1+ λ) P(i) = e -(1+ λ) This means that to maximize entropy, every P(i) must be equal.

15 This shows that P(1) = P(2) =... P(6) But, P(1) + P(2) +.. + P(6) = 1 Therefore P(1) = P(2) =... P(6) = 1/6

16 Introducing data in the notion of entropy Now, we introduce data: X : x 1, x 2, x 3.... x m (m observations) A: a 1, a 2, a 3... a n (n outcomes) e.g. For a coin, In absence of data: P(H) = P(T) = 1/2... (As shown in the previous proof) However, if data X is observed as follows: Obs-1: H H T H T T H H H (m=10) (n=2) P(H) = 6/10, P(T) = 4/10 Obs-2: T T H T H T H T T T (m=10) (n=2) P(H) = 3/10, P(T) = 7/10 Which of these is a valid estimate?

17 Change in entropy Entropy reduces as data is observed! Emax (uniform distribution) E2 : P(H) = 3/10 E1 : P(H) = 6/10

18 Start of Duality Maximizing entropy in this situation is same as minimizing the `entropy reduction’ distance. i.e. – Minimizing “relative entropy” Emax (Maximum entropy: uniform distribution) Edata (Entropy when Observations are made) Entropy reduction

19 Concluding remarks Thus, in this discussion of ML-ME duality, we will show that: MLE minimizes relative entropy distance from uniform distribution. Question: The entropy corresponds to probability vectors. The distance can be measured by squared distances. WHY relative entropy?

20 Maximum Likelihood-Maximum Entropy Duality : Session 2 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 28 th January, 2015

21 Recap (1/2) Whatever we can show with MLE, (in most cases), we can through ME as well Laplace’s unbiased reasoning principle Make only the most limited assumptions. We assume data: X: x 1, x 2.... X m m: number of observations Every xi is the outcome of the values of a random variable

22 Recap (2/2) E.g. x i = {1, 2,...6} for dice. If no data then by Laplace’s principle, uniform distribution is the best estimate of the probability of each outcome.

23 NLP Perspective Probability of mapping of words between parallel sentences. E.g. The mapping between “blue” and “neela”, and “blue” and “haraa” (wrong translation) So how can you maintain probabilities of correct mappings based on the corpus?

24 Starting off This week, we assume that we have data X: x 1.... x m Outcome of r.v. Is : a 1, a 2, a 3... a n in the data such that each a j occurs f i times. In absence of any data, Pj = 1/n When we have data?

25 In presence of data When we have the data, Pj = ? Answer: Pj = f j / m WHY? Can we arrive at this value through (a) MLE, or (b) ME ??

26 Taking the MLE route MLE: We maximize data likelihood P(X; θ) where θ = Under i.i.d., P(X; θ) = π P(xi; θ) = π Pj fj i=1 m m e.g. 1 1 2 3 4 5 4 P1 P1 P2 P3 P4 P5 P4... P1 2 P2 1 P3 1 P4 2 P5 1 Σ f j = m j=1 n Where,

27 Maximization MLE demands maximize P(X; θ) Subject to : Σ P j = 1 P j >= 0 for all i j=1 n Maximize ln (P(X; θ)) Σ P j = 1 P j >= 0 for all i i=1 n

28 Evaluating the parameter P j (1/2) Q = Σ f j ln P j – λ ( Σ P j – 1) j=1 n n Taking derivative w.r.t. P j δQ/δP j = f j / P j - λ Equating to zero, f j / P j – λ = 0 P j = f j / λ (1) Taking derivative w.r.t. λ δQ/δP j = Σ P j – 1 Equating to zero, Σ P j = 1 (2)

29 Evaluating the parameter P j (2/2) From (1) and (2), Σ f j / λ = 1 j=1 n Σ f j = λ j=1 n λ = m P j = f j / m Proved!

30 Summary 1)No observation: Pj = 1/n 2)Observation: Pj = fj/m ; Σ f j = m Entropy (as seen last time) MLE (as shown Today) ME??

31 Does entropy change Before we move on to discussion on ME in case of observed data, we first see how entropy gets affected in presence of data. In context of cases given in the previous slide, we wish to see: – Is it true that Entropy (Case 2) <= Entropy (Case 1)?? Let’s verify.

32 Situation 1: No observed data Entropy(Case1): - Σ P(j) log P(j) = - Σ (1/n) log (1/n) j=1 n n = log (1/n) = log (n)

33 Situation 2: Observed Data (1/2) Entropy(Case2): - Σ P(j) log P(j) = - Σ (f j /m) log (f j /m) j=1 n n = -1/m Σ f j log (f j /m) j=1 n = -1/m Σ f j [ log (f j ) – log(m) ] j=1 n

34 Situation 2: Observed Data (2/2) Entropy(Case2): - Σ P(j) log P(j) j=1 n = -1/m[ Σ f j log (f j ) – Σ f j log(m) ] j=1 n n = -1/m[ Σ f j log (f j ) – m log(m) ] j=1 n = [ m log(m) - Σ f j log (f j ) ] / m j=1 n

35 Comparing the entropies Now, we must show: m log(m) – m log(n) <= Σ f j log (f j ) j=1 n i.e. m log(m/n) <= Σ f j log (f j ) j=1 n.... We will show this in next class.

36 Change in entropy Entropy reduces as data is observed! No data, 1/n, E1 Data, fj/m, E2

37 Concluding remark If you can show that MLE is same as maximum entropy, then MLE-ME duality is shown.

38 Maximum Likelihood-Maximum Entropy Duality : Session 3 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 11 th February, 2015

39 -Σ f i log P i s.t. Σ P i = 1 & Σ f i = N Recap X1, X2, X3.... XN : Data A1, A2, A3... AM : Outcomes Goal: Estimate P(Ai) = Pi In absence of any other information (not even data), Pi = 1/M This was obtained using ME Max. -Σ P i log P i s.t. Σ P j = 1 i=1 m m When data is observed, Pi = fi/N where fi = freq(Ai) This was obtained using ME Max. i=1 m m m

40 Change in entropy Entropy changes as data is observed. Today, we show that the entropy is “reduced”. i.e. E u > E d No data, 1/m, E u Data, fj/n, E d

41 Today’s goal Today, we will show that the entropy is “reduced”. i.e. E u > E d Two proofs

42 Proof 1 (1/..) Suppose without loss of generality, P1 -> P + ε P2 -> P - ε Thus, in case of uniform distribution, P1 = P2 =... P = 1/M, and ε = 0 -Σ P i log P i i=1 m Eu = Ed = -(P+ε) log(P+ε) -(P-ε) log(P-ε) -Σ P i log P i i=3 m

43 Proof 1 (2/...) Eu – Ed = -P log P + (P+ε) log(P+ε) -P log P + (P-ε) log(P-ε) = P log ((P+ε)/P) + ε (log P + log (1+ε/P)) + P log ((P-ε)/P) - ε (log P + log (1-ε/P)) = P [log (1+ε/P)+ log (1-ε/P)] + ε [log (1+ε/P) - log (1-ε/P)]

44 Proof 1 (3/...) Eu – Ed = P [log (1+ε/P)+ log (1-ε/P)] + ε [log (1+ε/P) - log (1-ε/P)] = P [log (1 – ε 2 /P 2 )] + ε [log ((1+ε/P)/(1-ε/P))] = P [log (1 – y 2 )] + ε [log ((1+y)/(1-y))] Where, y= ε/P

45 Proof 1 (4/...) Eu-Ed = P [log (1 – y 2 )] + ε [log ((1+y)/(1-y))] = P [ -y 2 –y 4 /2 –y 6 /3 – y 8 /4...] + ε[2y + 2y 3 /3+2y 5 /5+2y 7 /7...] = -Py 2 [1+ y 2 /2+ y 4 /3+ y 6 /4...] + Py 2 [2 + 2y 2 /3+ 2y 4 /5+ 2y 6 /7...] log (1+x) = x – x 2 /2 + x 3 /3 – x 4 /4+.. log ((1+x)/(1-x)) = 2x +2x 3 /3+2x 5 /5+2x 7 /7+.....substitute ε = Py When we compare the above statement term by term, (1..2), (1/2.. 2/3), (1/3...2/5).. Etc., we see that the above value is > 0. i.e. Eu – Ed > 0. i.e. Eu > Ed

46 Proof 2 (1/...) Identity: for x,y > 0 y – ylog y <= x – ylog x Proof: log t <= log t -1 Put t = (x/y). log (x/y) <= (x/y) – 1 i.e. log x - log y <= (x-y)/y i.e. y log x – y log y <= x – y i.e. y – y log y <= x – y log x

47 Proof 2 (2/...) p 1, p 2.... p m = 1/M.............. Uniform q 1, q 2... q m............... Perturbed distribution qi – q i log q i <= p i - q i log p i take sum over i = 1....m, We put x = p, y= q in y – ylog y <= x – ylog x Σ q i – Σq i log q i <= Σ p i - Σ q i log p i But, Σ q i = 1 & Σ p i = 1 1– Σq i log q i <= 1 + log M Σ q i 1 + Ed <= 1 + log M 1 + Ed <= 1 + Eu Eu >= Ed

48 Conclusion We showed today that there is a “reduction” in entropy when data is observed. We showed two proofs: the first is more intuitive but long; the second is simpler.

49 Maximum Likelihood-Maximum Entropy Duality : Session 4 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 25 th March, 2015

50 A uniform distribution and any pd P Uniform, 1/n P Let us now compute relative entropy / KLD between these two vectors.

51 Relative entropy between two non- negative vectors P:, Q: D(P||Q) = Σ p i log (p i /q i ) - Σ p i + Σ q i i=1 n n n KL Divergence or relative entropy between the two vectors is defined as: & D(P||Q) >= 0 & D(P||Q) = 0 occurs, when p i = q i

52 Relative entropy between uniform distribution and any p.d. p D(P||1/n) = Σ p i log (p i /(1/n)) - Σ p i + Σ (1/n) = Σ p i log p i + Σ p i log n - 1 + 1 = - E(P) + log n i=1 n n n D(P || 1/n ) = - E(P) + log n

53 Relation between entropy of a pd and its relative entropy with uniform This shows that, if we maximize entropy P, we minimize relative entropy with respect to uniform distribution. Conclusion: The probability distribution found by maximizing entropy is the distribution with least KLD (relative entropy) from uniform distribution. D(P || 1/n ) = - E(P) + log n argmin D(P || 1/n ) = argmax E(P) PP

54 Digression What constitutes a valid proof? Statement S, Path P The Euclidean distance path may not lead us to proving statement S. How statement S can be proven may depend on path P and the choice of axioms.

55 A uniform distribution and any pd with data Uniform, 1/n f 1, f 2,...f n : F Now, when we have data (n-ary) given by frequencies (f 1, f 2... f n ) as shown above. Let us now compute relative entropy / KLD between these two vectors.

56 Relative entropy bet. uniform and pd p U: PD: D(RF||1/n) = Σ (fi/m) log(fi/m)/(1/n)) = - (1/m) Σ fi log fi = - (1/m) [ log π Pj fj ] i=1 n KL Divergence or relative entropy between the two vectors is defined as:

57 Relation between RF and uniform We find that relative entropy D(RF||1/n) = log n + 1/m [log-likelihood]

58 Frequency distribution F and a pd P Suppose F is the frequency of outcomes: F : f 1, f 2.... f n m : total no. of observations Let P be “some” probability distribution Let us now find the KLD between F and P

59 Relative entropy between frequency distribution and any p.d. p D(F||P) = Σ f i log (f i / p i ) - Σ f i + Σ p i = Σ f i log f i - Σ log p i fi - m + 1 = constant – log-likelihod - constant i=1 n n n D(F||P) = constant – log-likelihood

60 Relation between entropy of P and its relative entropy with freq. dist. This shows that, if we maximize log-likelihood of P, we minimize relative entropy with respect to frequency distribution Conclusion: The probability distribution found by maximizing log-likelihood is the distribution with least KLD (relative entropy) from frequency distribution. D(F||P) = constant – log-likelihood argmin D(F||P) = argmax LL(P) P P

61 Concluding remarks Observations today: 1)The P found by maximizing entropy is the one with least RE w.r.t. uniform distribution 2)The P found by maximizing likelihood is the one with least RE w.r.t. frequency distribution Work to be done: A)Bring in fs into (1) above B)Bring in 1/n into (2) above

62 Maximum Likelihood-Maximum Entropy Duality : Session 5 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 17 th June, 2015

63 Recap Some notations have been corrected in this presentation Setting: X: X 1, X 2,.... X m (data) A: A 1, A 2,.... A n (outcomes) P: P 1, P 2,.... P n (Parameters to be estimated) For example: X: 2 1 3 5 6 3 4 A: 1 2 3 4 5 6 P: P 1, P 2,.... P 6

64 Recap: Goal To estimate parameters P, there are two paths: a)Maximum Likelihood: P i = f i /m... As shown b)Maximum entropy: P i = 1/n ; data not considered We also showed that: Entropy is maximum for uniform distribution P i = 1/n for all i H u = Entropy for uniform Distr. = log n H = - Σ P i log P i H u = - Σ (1/n) log (1/n) = log (1/n) = log n i=1 n j=1 n = log (1/n) = log (n) i=1 n

65 Recap: Reduction in entropy When we have data, We have shown in two ways that: Hu > Hd. i.e. If we perturb a distribution 1/n, 1/n.... To 1/n+k, 1/n-k, 1/n....., the entropy decreases H u = Entropy for uniform Distr. H d = Entropy in case of data

66 Recap: Relative Entropy P: p 1, p 2, p 3,... p n Q: q 1, q 2, q 3,... q n D(p||q) = Σ P i = 1 i=1 n Σ q i = 1 i=1 n Σ p i log (p i / q i ) i=1 n

67 Recap: Relative entropy For q: uniform distribution, D(p||q) = = = - H p + log n = H u - H p Σ p i log (p i / q i ) i=1 n Σ p i log (p i / (1/n) ) i=1 n Σ p i log p i - log (1/n) i=1 n PIVOT: Relative entropy and absolute entropy difference are same for uniform distribution.

68 Approach to prove duality Theorem: MLE-ME converge when: – Distance measure = Relative entropy – Distributions belong to exponential family

69 Data v/s feature matrix X 1 X 2 X i X m F1F2FiFmF1F2FiFm F X promotion Word ends with “tion” Outcome: POS tags

70 A note on Outcomes A: A 1, A 2..... A n N is tied to the classification task. For POS tagging: (Penn Tagset) A: A 1, A 2..... A 39

71 Exponential & Constrained distributions P i = c i π λ j fji = c i e λj ^(fji) j=1 k P = { P s.t. E P F = E P F } Expected value Distribution Obtained from data Distribution intended Expected value s.t. Feature Value is F under distribution P

72 Equivalence between constrained and exponential Now, let: P: Constrained distribution Q: exponential distribution And P* = P intersection Q By ME, we will show that: H(P*) >= H(q) for all q By MLE, we will show that: LL(P*) >= LL(p) for all p This is done using concepts like: (a) Pythagorean distance for relative entropy, (b) LL in terms of P

73 Maximum Likelihood-Maximum Entropy Duality : Session 6 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 2 nd September, 2015

74 Recap (1/3) O: o 1, o 2, o 3.... o n observations A: a 1, a 2, a 3.... a m outcomes P: p 1, p 2, p 3.... p m prob. distribution F: f 1, f 2, f 3.... f m Frequency P i = f i /n.... empirical distribution acc to MLE ~ P i = 1/m.... distribution acc to ME, assuming uniform

75 Recap (2/3) Entropy: H u, entropy with uniform distrib = log m H d... when data H d <= H u

76 Recap (3/3) P:, Q: D(P||Q) = Σ p i log (p i /q i ) - Σ p i + Σ q i i=1 n n n KL Divergence or relative entropy between the two vectors is defined as: D (P || Uniform) = H u - H p

77 A small digression Why is entropy in log form? e.g. Suppose two dice are thrown. Sample space: 36,,.... If I say x i + y i = 12 v/s x i + y i = 7, which message has more information?

78 Which message is more informative? 12 : – Probability of each outcome = 1 – Uncertainty less 7:,,,, – Probability of each outcome = 1/6 – Uncertainty more 01 P Uncertainty approx.

79 Additional certainty or information If there are two independent variables, then intuitively: – I(pq) = I(p) + I(q) – The log function seems to satisfy the above requirement – e.g. Uncertainty of POS for play >= played – Less certainty --  More entropy

80 Familiar example: POS tagging Vocabulary, V : V 1, V 2, V 3... V |V| Tag set, T : T 1, T 2, T 3..... T |T| Call |V|X|T| = A We wish to estimate P: P 1 P 2 P 3..... P A where Pi = P (V v, T t ) e.g. P(“play”, “NN”)

81 Dataset and features X: (w 1,t 1 ), (w 2,t 2 ), (w 3,t 3 ),.... (w n, t n ) We introduce features, F: F 1, F 2, F 3... F |F| example: F f = 1 if V v ends with “-tion” = 0 otherwise Binary

82 Binding features with V X T example: Features = (1) ends with –tion, (2) has 3 syllables, etc. F is a matrix with |F| rows and |V| X |T| columns P : P 1, P 2, P 3.... P |V||T| ~

83 Expected value of a feature F VXT........... F 1 F 2 F 3. F |F|

84 Entropy For q: uniform distribution, Ep (F fij ) = For empirical distribution Ep (F fij ) = similarly. Σ Pr (F f ) Value (F f ) i=1 |V| |T| ii

85 Introducing two distributions Constrained distribution: (w.r.t training data) Exponential distribution: R : { r(y) : E r (F) = E P (F) } where y belongs to V X T ~ S: { s(y) : s(y) = k e e.g. s(“promotion”, “NN”) Σ μ f F f f= 1 |F|

86 Pythagorean theorem D (r || s) = D ( r || p*) + D (p* || s) p * is an intersection of r and s

87 Conclusion: Intuition The intuition is: – Constrained distribution brings in likelihood – Exponential distribution is due to entropy – Hence, the duality. – We will continue in the next session.

88 Maximum Likelihood-Maximum Entropy Duality : Session 7 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 20 th September, 2015

89 Recap: POS Tagging (1/2) Vocabulary, V : V 1, V 2, V 3... V |V| Tag set, T : T 1, T 2, T 3..... T |T| Features, F: F 1, F 2, F 3... F |F| “Promotion”, “NN”, has ‘tion’ V i T j F k Goal: Phenomenon Algorithm Model

90 Recap: POS Tagging (2/2) P: P 1 P 2 P 3..... P |V| X |T| P(T|W) = P(T,W) / P(W) Hence, P(“Promotion”,”NN”) is the kind of probabilities we wish to estimate Observations: O: O 1, O 2, O 3... O m {o} ε {V} X {T}.... In a tagged corpus

91 Constrained Distribution R: {r | E r (F k ) = E p (F k ) } Where, P = empirical distribution ~ K = 1.... |F| ~ E r (F k ) = Σ V(F k ). Pr (F k ) All values F k F k are such that V(F k ) = 0 or 1 E r (F k ) = Pr(F k ) = Σ Pr (F k, x) = Σ Pr(x) Pr (F k |x) X ε {V} X {T} x x

92 Exponential distribution Uniform distribution, U is a member of the exponential family where μ f = 0 S: { s(y) : s(y) = k e Weights of features Σ μ f F f f= 1 |F|

93 Pythagorian theorem Let there be a p* that is a member of S ∩ R The theorem states that for such a p* D(s||r) = D(s||p*) + D(p*||r) where s and r come from R

94 Proof (1/3) LHS = D(s||r) = Σ s(x). log (s(x) - Σ s(x). log (r(x)) = - H(s) – E s log (r(x)) xx

95 Proof (2/3) RHS = D(s||p*) + D(p*||r) = Σ s(x) log s(x) - Σ s(x) log (p*(x)) + Σ p*(x) log (p*(x)) - Σ p*(x) log (r(x)) x x x x

96 Proof (3/3) We will show later that term 2 and 3 cancel and the last term in LHS will be equal to remaining in RHS This proof is also based on the fact that relative entropy >= 0. This proof will be shown later.

97 Taking the ME path (1/1) With Pythagorian theorem in view, we wish to show that: p* = argmax (H(r)) Consider D(r||u) where u is uniform distrib. D(r||u) = D(r||p*) + D(p*||u) Since D(.) >= 0, D(r||u) >= D(p*||u) r

98 Taking the ME path (2/2) D(r||u) = -H(r) - Σ r(x). log (u(x)) = - H(r) + log (|V|X |T|) x D(p*||u) = -H(p*) - Σ p*(x). log (u(x)) = - H(p*) + log (|V|X |T|) x Therefore, -H(r) + log (|V|X|T|) >= -H(p*) + log (|V|X|T|) -- H(r) >= -H(p*) H(r)<= H(p*) ---------------- (A)

99 Taking the MLE Path (1/3) Consider D(p || s) = D(p || p*) + D(p* || s) Where p is a constrained distribution ~ ~~ D(p || s) >= D(p || p*) ~ ~ -H(p) - Σ p(x). log (s(x)) >= - H(p) - Σ p(x). log (p*(x)) ~~ Σ p(x). log (s(x)) <= Σ p(x). log (p*(x)) ~ ~~ ~

100 Taking the MLE path (2/3) Suppose p(x) = fx / |o| ~ Σ f(x). log (s(x)) <= Σ f(x). log (p*(x)) Σ log (s(x) f(x) ) <= Σ log (p*(x) f(x) ) LL(s) <= LL (p*) ------ (B)

101 The Duality! By (A) and (B), p* maximizes entropy ( From (A)) maximizes log-likelihood (From B)

102 Pending critical subproofs 1)p*(x) is unique 2)Proof of Pythagorian theorem 3)D(p||q) >= 0 4)U(x) is a constrained distribution 5)Is p(x) = f x /|o| ~


Download ppt "Maximum Likelihood-Maximum Entropy Duality : Session 1 Pushpak Bhattacharyya Scribed by Aditya Joshi Presented in NLP-AI talk on 14 th January, 2015."

Similar presentations


Ads by Google