Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan Roth University of Illinois at Urbana-Champaign.

Similar presentations


Presentation on theme: "Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan Roth University of Illinois at Urbana-Champaign."— Presentation transcript:

1 Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan Roth University of Illinois at Urbana-Champaign Page 1 NAACL 2012, Montreal

2 Weakly Supervised Learning in NLP Labeled data is scarce and difficult to obtain A lot of work on learning with a small amount of labeled data Expectation Maximization (EM) algorithm is the de facto standard More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM  Constraint-driven Learning (CoDL; Chang et al, 07)  Posterior regularization (PR; Ganchev et al, 10) Page 2

3 Weakly Supervised Learning: EM and …? Several variants of EM exist in the literature : Hard EM Variants of constrained EM: CoDL and PR Which version to use: EM (PR) vs hard EM (CoDL)?????  Or is there something better out there? OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM)  Includes existing EM algorithms  Pick the most suitable EM algorithm in a simple, adaptive, and principled way  Adapting to data, initialization, and constraints Page 3

4 Outline Background: Expectation Maximization (EM)  EM with constraints Unified Expectation Maximization (UEM) Optimization Algorithm for the E-step Experiments Page 4

5 Predicting Structures in NLP Predict the output or dependent variable y from the space of allowed outputs Y given input variable x using parameters or weight vector w E.g.  predict POS tags given a sentence,  predict word alignments given sentences in two different languages,  predict the entity-relation structure from a document Prediction expressed as y * = argmax y 2 Y P (y | x; w) Page 5

6 Learning Using EM: a Quick Primer Given unlabeled data: x, estimate w ; hidden: y for t = 1 … T do  E:step: estimate a posterior distribution, q, over y :  M:step: estimate the parameters w w.r.t. q : w t+1 = argmax w E q log P ( x, y; w ) Page 6 q t (y) = P (y | x;w t ) q t (y) = argmin q KL( q(y), P (y|x;w t ) ) (Neal and Hinton, 99) Conditional distribution of y given w Posterior distribution

7 Other Version of EM: Hard EM Standard EM E-step: argmin q KL(q t (y),P (y|x;w t )) M-step: argmax w E q log P (x, y; w) Hard EM E-step: M-step: argmax w E q log P (x, y; w) Page 7 q(y) = ± y = y * y * = argmax y P(y | x,w) Not clear which version To use!!!

8 Constrained EM Domain knowledge-based constraints can help a lot by guiding unsupervised learning  Constraint-driven Learning (Chang et al, 07),  Posterior Regularization (Ganchev et al, 10),  Generalized Expectation Criterion (Mann & McCallum, 08),  Learning from Measurements (Liang et al, 09) Constraints are imposed on y (a structured object, { y 1,y 2 …y n }) to specify/restrict the set of allowed structures Y Page 8

9 Entity-Relation Prediction: Type Constraints Predict entity types: Per, Loc, Org, etc. Predict relation types: lives-in, org-based-in, works-for, etc. Entity-relation type constraints Dole ’s wife, Elizabeth, is a resident of N.C. E 1 E 2 E3 R 12 R 23 Page 9 lives-in LocPer

10 Bilingual Word Alignment: Agreement Constraints Align words from sentences in EN with sentences in FR Agreement constraints: alignment from EN-FR should agree with the alignment from FR-EN (Ganchev et al, 10) Picture: courtesy Lacoste-Julien et al 10

11 Structured Prediction Constraints Representation Assume a set of linear constraints: Y = { y : Uy · b }  A universal representation (Roth and Yih, 07) Can be relaxed into expectation constraints on posterior probabilities : E q [Uy] · b Focus on introducing constraints during the E-step Page 11

12 Posterior Regularization (Ganchev et al, 10) E-step: argmin q KL(q t (y),P (y|x;w t )) E q [Uy] · b M-step: argmax w E q log P (x, y; w) Constraint driven-learning (Chang et al, 07) E-step: M-step: argmax w E q log P (x, y; w) y * = argmax y P(y | x,w) Uy · b Not clear which version To use!!! Two Versions of Constrained EM Page 12

13 So how do we learn…? EM (PR) vs hard EM (CODL)  Unclear which version of EM to use (Spitkovsky et al, 10) This is the initial point of our research We present a family of EM algorithms which includes these EM algorithms (and infinitely many new EM algorithms): Unified Expectation Maximization (UEM) UEM lets us pick the best EM algorithm in a principled way Page 13

14 Outline Notation and Expectation Maximization (EM) Unified Expectation Maximization  Motivation  Formulation and mathematical intuition Optimization Algorithm for the E-step Experiments Page 14

15 Motivation: Unified Expectation Maximization (UEM) EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution UEM tunes the entropy of the posterior distribution q and is parameterized by a single parameter ° Page 15 EMHard EM

16 EM (PR) minimizes the KL-Divergence KL ( q, P ( y | x;w )) KL ( q, p ) =  y q ( y ) log q ( y ) – q ( y ) log p ( y ) UEM changes the E-step of standard EM and minimizes a modified KL divergence KL ( q, P ( y | x;w ); ° ) where KL ( q, p ; ° ) =  y ° q ( y ) log q ( y ) – q ( y ) log p ( y ) Different ° values ! different EM algorithms Changes the entropy of the posterior Unified EM (UEM) Page 16

17 Effect of Changing ° Original Distribution p q with ° = 1 q with ° = 0 q with ° = 1 q with ° = -1 Page 17 KL(q, p; °) =  y ° q(y) log q(y) – q(y) log p(y)

18 Unifying Existing EM Algorithms Page 18 No Constraints With Constraints KL(q, p; °) =  y ° q(y) log q(y) – q(y) log p(y) ° Hard EM CODL EM PR Deterministic Annealing (Smith and Eisner, 04; Hofmann, 99) Changing ° values results in different existing EM algorithms

19 Range of ° Page 19 No Constraints With Constraints KL(q, p; °) =  y ° q(y) log q(y) – q(y) log p(y) ° 01 Hard EMEM PRLP approx to CODL (New) We focus on tuning ° in the range [0,1] Infinitely many new EM algorithms

20 Tuning ° in practice ° essentially tunes the entropy of the posterior to better adapt to data, initialization, constraints, etc. We tune ° using a small amount of development data over the range UEM for arbitrary ° in our range is very easy to implement: existing EM/PR/hard EM/CODL codes can be easily extended to implement UEM Page ……

21 Outline Setting up the problem Unified Expectation Maximization Solving the constrained E-step  Lagrange dual-based based algorithm  Unification of existing algorithms Experiments Page 21

22 The Constrained E-step For ° ¸ 0 ) convex Page 22 Domain knowledge-based linear constraints ° -Parameterized KL divergence Standard probability simplex constraints

23 1 Introduce dual variables ¸ for each constraint 2 Sub-gradient ascent on dual vars with O ¸ / E q [ Uy ] – b 3 Compute q for given ¸  For °>0, compute  With ° !0, unconstrained MAP inference: Page 23 Solving the Constrained E-step for q ( y ) Iterate until convergence

24 Some Properties of our E-step Optimization We use a dual projected sub-gradient ascent algor ithm (Bertsekas, 99)  Includes inequality constraints For special instances where two (or more) “easy” problems are connected via constraints, reduces to dual decomposition  For ° > 0: convex dual decomposition over individual models (e.g. HMMs) connected via dual variables ° = 1: dual decomposition in posterior regularization (Ganchev et al, 08)  For ° = 0: Lagrange relaxation/dual decomposition for hard ILP inference (Koo et al, 10; Rush et al, 11) Page 24

25 Outline Setting up the problem Introduction to Unified Expectation Maximization Lagrange dual-based optimization Algorithm for the E-step Experiments  POS tagging  Entity-Relation Extraction  Word Alignment Page 25

26 Experiments: exploring the role of ° Test if tuning ° helps improve the performance over baselines Study the relation between the quality of initialization and ° (or “hardness” of inference) Compare against:  Posterior Regularization (PR) corresponds to ° = 1.0  Constraint-driven Learning (CODL) corresponds to ° = - 1 Page 26

27 Unsupervised POS Tagging Model as first order HMM Try varying qualities of initialization:  Uniform initialization: initialize with equal probability for all states  Supervised initialization: initialize with parameters trained on varying amounts of labeled data Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization Page 27

28 Unsupervised POS tagging: Different EM instantiations Uniform Initialization Initialization with 5 examples Initialization with 10 examples Initialization with 20 examples Initialization with examples ° Performance relative to EM Hard EMEM Page 28

29 Experiments: Entity-Relation Extraction Extract entity types (e.g. Loc, Org, Per ) and relation types (e.g. Lives-in, Org-based-in, Killed ) between pairs of entities Add constraints:  Type constraints between entity and relations  Expected count constraints to regularize the counts of ‘None’ relation Semi-supervised learning with a small amount of labeled data Page 29 Dole ’s wife, Elizabeth, is a resident of N.C. E 1 E 2 E3 R 12 R 23

30 Result on Relations Page 30 % of labeled data Macro-f1 scores

31 Experiments: Word Alignment Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for word alignment PR with agreement constraints known to give HUGE improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into individual HMMs Page 31

32 Word Alignment: EN-FR with 10k Unlabeled Data Page 32 Alignment Error Rate

33 Word Alignment: EN-FR Page 33 Alignment Error Rate

34 Word Alignment: FR-EN Page 34 Alignment Error Rate

35 Word Alignment: EN-ES Page 35 Alignment Error Rate

36 Word Alignment: ES-EN Page 36 Alignment Error Rate

37 Experiments Summary In different settings, different baselines work better  Entity-Relation extraction: CODL does better than PR  Word Alignment: PR does better than CODL  Unsupervised POS tagging: depends on the initialization UEM allows us to choose the best algorithm in all of these cases  Best version of EM: a new version with 0 < ° < 1 Page 37

38 Unified EM: Summary UEM generalizes existing variations of EM/constrained EM UEM provides new EM algorithms parameterized by a single parameter ° Efficient dual projected subgradient ascent technique to incorporate constraints into UEM The best ° corresponds to neither EM (PR) nor hard EM (CODL) and found through the UEM framework  Tuning ° adaptively changes the entropy of the posterior UEM is easy to implement: add a few lines of code to existing EM codes Page 38 Questions?


Download ppt "Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan Roth University of Illinois at Urbana-Champaign."

Similar presentations


Ads by Google