Download presentation

Presentation is loading. Please wait.

Published byMakaila Urton Modified over 2 years ago

1
Unified Expectation Maximization Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan Roth University of Illinois at Urbana-Champaign Page 1 NAACL 2012, Montreal

2
Weakly Supervised Learning in NLP Labeled data is scarce and difficult to obtain A lot of work on learning with a small amount of labeled data Expectation Maximization (EM) algorithm is the de facto standard More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM Constraint-driven Learning (CoDL; Chang et al, 07) Posterior regularization (PR; Ganchev et al, 10) Page 2

3
Weakly Supervised Learning: EM and …? Several variants of EM exist in the literature : Hard EM Variants of constrained EM: CoDL and PR Which version to use: EM (PR) vs hard EM (CoDL)????? Or is there something better out there? OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM) Includes existing EM algorithms Pick the most suitable EM algorithm in a simple, adaptive, and principled way Adapting to data, initialization, and constraints Page 3

4
Outline Background: Expectation Maximization (EM) EM with constraints Unified Expectation Maximization (UEM) Optimization Algorithm for the E-step Experiments Page 4

5
Predicting Structures in NLP Predict the output or dependent variable y from the space of allowed outputs Y given input variable x using parameters or weight vector w E.g. predict POS tags given a sentence, predict word alignments given sentences in two different languages, predict the entity-relation structure from a document Prediction expressed as y * = argmax y 2 Y P (y | x; w) Page 5

6
Learning Using EM: a Quick Primer Given unlabeled data: x, estimate w ; hidden: y for t = 1 … T do E:step: estimate a posterior distribution, q, over y : M:step: estimate the parameters w w.r.t. q : w t+1 = argmax w E q log P ( x, y; w ) Page 6 q t (y) = P (y | x;w t ) q t (y) = argmin q KL( q(y), P (y|x;w t ) ) (Neal and Hinton, 99) Conditional distribution of y given w Posterior distribution

7
Other Version of EM: Hard EM Standard EM E-step: argmin q KL(q t (y),P (y|x;w t )) M-step: argmax w E q log P (x, y; w) Hard EM E-step: M-step: argmax w E q log P (x, y; w) Page 7 q(y) = ± y = y * y * = argmax y P(y | x,w) Not clear which version To use!!!

8
Constrained EM Domain knowledge-based constraints can help a lot by guiding unsupervised learning Constraint-driven Learning (Chang et al, 07), Posterior Regularization (Ganchev et al, 10), Generalized Expectation Criterion (Mann & McCallum, 08), Learning from Measurements (Liang et al, 09) Constraints are imposed on y (a structured object, { y 1,y 2 …y n }) to specify/restrict the set of allowed structures Y Page 8

9
Entity-Relation Prediction: Type Constraints Predict entity types: Per, Loc, Org, etc. Predict relation types: lives-in, org-based-in, works-for, etc. Entity-relation type constraints Dole ’s wife, Elizabeth, is a resident of N.C. E 1 E 2 E3 R 12 R 23 Page 9 lives-in LocPer

10
Bilingual Word Alignment: Agreement Constraints Align words from sentences in EN with sentences in FR Agreement constraints: alignment from EN-FR should agree with the alignment from FR-EN (Ganchev et al, 10) Picture: courtesy Lacoste-Julien et al 10

11
Structured Prediction Constraints Representation Assume a set of linear constraints: Y = { y : Uy · b } A universal representation (Roth and Yih, 07) Can be relaxed into expectation constraints on posterior probabilities : E q [Uy] · b Focus on introducing constraints during the E-step Page 11

12
Posterior Regularization (Ganchev et al, 10) E-step: argmin q KL(q t (y),P (y|x;w t )) E q [Uy] · b M-step: argmax w E q log P (x, y; w) Constraint driven-learning (Chang et al, 07) E-step: M-step: argmax w E q log P (x, y; w) y * = argmax y P(y | x,w) Uy · b Not clear which version To use!!! Two Versions of Constrained EM Page 12

13
So how do we learn…? EM (PR) vs hard EM (CODL) Unclear which version of EM to use (Spitkovsky et al, 10) This is the initial point of our research We present a family of EM algorithms which includes these EM algorithms (and infinitely many new EM algorithms): Unified Expectation Maximization (UEM) UEM lets us pick the best EM algorithm in a principled way Page 13

14
Outline Notation and Expectation Maximization (EM) Unified Expectation Maximization Motivation Formulation and mathematical intuition Optimization Algorithm for the E-step Experiments Page 14

15
Motivation: Unified Expectation Maximization (UEM) EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution UEM tunes the entropy of the posterior distribution q and is parameterized by a single parameter ° Page 15 EMHard EM

16
EM (PR) minimizes the KL-Divergence KL ( q, P ( y | x;w )) KL ( q, p ) = y q ( y ) log q ( y ) – q ( y ) log p ( y ) UEM changes the E-step of standard EM and minimizes a modified KL divergence KL ( q, P ( y | x;w ); ° ) where KL ( q, p ; ° ) = y ° q ( y ) log q ( y ) – q ( y ) log p ( y ) Different ° values ! different EM algorithms Changes the entropy of the posterior Unified EM (UEM) Page 16

17
Effect of Changing ° Original Distribution p q with ° = 1 q with ° = 0 q with ° = 1 q with ° = -1 Page 17 KL(q, p; °) = y ° q(y) log q(y) – q(y) log p(y)

18
Unifying Existing EM Algorithms Page 18 No Constraints With Constraints KL(q, p; °) = y ° q(y) log q(y) – q(y) log p(y) ° 10-11 Hard EM CODL EM PR Deterministic Annealing (Smith and Eisner, 04; Hofmann, 99) Changing ° values results in different existing EM algorithms

19
Range of ° Page 19 No Constraints With Constraints KL(q, p; °) = y ° q(y) log q(y) – q(y) log p(y) ° 01 Hard EMEM PRLP approx to CODL (New) We focus on tuning ° in the range [0,1] Infinitely many new EM algorithms

20
Tuning ° in practice ° essentially tunes the entropy of the posterior to better adapt to data, initialization, constraints, etc. We tune ° using a small amount of development data over the range UEM for arbitrary ° in our range is very easy to implement: existing EM/PR/hard EM/CODL codes can be easily extended to implement UEM Page 20 0 1.1.2.3 ……

21
Outline Setting up the problem Unified Expectation Maximization Solving the constrained E-step Lagrange dual-based based algorithm Unification of existing algorithms Experiments Page 21

22
The Constrained E-step For ° ¸ 0 ) convex Page 22 Domain knowledge-based linear constraints ° -Parameterized KL divergence Standard probability simplex constraints

23
1 Introduce dual variables ¸ for each constraint 2 Sub-gradient ascent on dual vars with O ¸ / E q [ Uy ] – b 3 Compute q for given ¸ For °>0, compute With ° !0, unconstrained MAP inference: Page 23 Solving the Constrained E-step for q ( y ) Iterate until convergence

24
Some Properties of our E-step Optimization We use a dual projected sub-gradient ascent algor ithm (Bertsekas, 99) Includes inequality constraints For special instances where two (or more) “easy” problems are connected via constraints, reduces to dual decomposition For ° > 0: convex dual decomposition over individual models (e.g. HMMs) connected via dual variables ° = 1: dual decomposition in posterior regularization (Ganchev et al, 08) For ° = 0: Lagrange relaxation/dual decomposition for hard ILP inference (Koo et al, 10; Rush et al, 11) Page 24

25
Outline Setting up the problem Introduction to Unified Expectation Maximization Lagrange dual-based optimization Algorithm for the E-step Experiments POS tagging Entity-Relation Extraction Word Alignment Page 25

26
Experiments: exploring the role of ° Test if tuning ° helps improve the performance over baselines Study the relation between the quality of initialization and ° (or “hardness” of inference) Compare against: Posterior Regularization (PR) corresponds to ° = 1.0 Constraint-driven Learning (CODL) corresponds to ° = - 1 Page 26

27
Unsupervised POS Tagging Model as first order HMM Try varying qualities of initialization: Uniform initialization: initialize with equal probability for all states Supervised initialization: initialize with parameters trained on varying amounts of labeled data Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization Page 27

28
Unsupervised POS tagging: Different EM instantiations Uniform Initialization Initialization with 5 examples Initialization with 10 examples Initialization with 20 examples Initialization with 40-80 examples ° Performance relative to EM Hard EMEM Page 28

29
Experiments: Entity-Relation Extraction Extract entity types (e.g. Loc, Org, Per ) and relation types (e.g. Lives-in, Org-based-in, Killed ) between pairs of entities Add constraints: Type constraints between entity and relations Expected count constraints to regularize the counts of ‘None’ relation Semi-supervised learning with a small amount of labeled data Page 29 Dole ’s wife, Elizabeth, is a resident of N.C. E 1 E 2 E3 R 12 R 23

30
Result on Relations Page 30 % of labeled data Macro-f1 scores

31
Experiments: Word Alignment Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for word alignment PR with agreement constraints known to give HUGE improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into individual HMMs Page 31

32
Word Alignment: EN-FR with 10k Unlabeled Data Page 32 Alignment Error Rate

33
Word Alignment: EN-FR Page 33 Alignment Error Rate

34
Word Alignment: FR-EN Page 34 Alignment Error Rate

35
Word Alignment: EN-ES Page 35 Alignment Error Rate

36
Word Alignment: ES-EN Page 36 Alignment Error Rate

37
Experiments Summary In different settings, different baselines work better Entity-Relation extraction: CODL does better than PR Word Alignment: PR does better than CODL Unsupervised POS tagging: depends on the initialization UEM allows us to choose the best algorithm in all of these cases Best version of EM: a new version with 0 < ° < 1 Page 37

38
Unified EM: Summary UEM generalizes existing variations of EM/constrained EM UEM provides new EM algorithms parameterized by a single parameter ° Efficient dual projected subgradient ascent technique to incorporate constraints into UEM The best ° corresponds to neither EM (PR) nor hard EM (CODL) and found through the UEM framework Tuning ° adaptively changes the entropy of the posterior UEM is easy to implement: add a few lines of code to existing EM codes Page 38 Questions?

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google