Presentation is loading. Please wait.

Presentation is loading. Please wait.

Edit Distances William W. Cohen.

Similar presentations


Presentation on theme: "Edit Distances William W. Cohen."— Presentation transcript:

1 Edit Distances William W. Cohen

2 Midterm progress reports
Talk for 5min per team You probably want to have one person speak Talk about The problem & dataset The baseline results What you plan to do next Send Brendan 3-4 slides in PDF by Mon night

3 Plan for this week Why EM works Edit distances Learning edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Why EM works Discriminative learning for pair HMMs

4 Motivation Common problem: classify a pair of strings (s,t) as “these denote the same entity [or similar entities]” Examples: (“Carnegie-Mellon University”, “Carnegie Mellon Univ.”) (“Noah Smith, CMU”, “Noah A. Smith, Carnegie Mellon”) Applications: Co-reference in NLP Linking entities in two databases Removing duplicates in a database Finding related genes “Distant learning”: training NER from dictionaries

5 Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 gap alignment t op cost

6 Computing Levenshtein distance - 2
D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

7 Computing Levenshtein distance – 4
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = min C O H E N M 1 2 3 4 5 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

8 Extensions Add parameters for differential costs for delete, substitute, … operations Eg “gap cost” G, substitution costs dxy(x,y) Allow s to match a substring of t (Smith-Waterman) Model cost of length-n insertion as A + Bn instead of Gn “Affine distance” Need to remember if a gap is open in s, t, or neither

9 Forward-backward for HMMs
All paths to st=i and all emissions up to and including t All paths after st=i and all emissions after t

10 pass thru states i,j at t,t+1
EM for HMMs pass thru state i at t and emit a at t pass thru states i,j at t,t+1 …and con’t to end

11 Pair HMM Example e Pr(e) <a,a> 0.10 <e,e> <h,h> <e,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e> Strings x,y produced by zT: x=heehee, y=teehe Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings

12 Distances based on pair HMMs

13 Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K h e α(3,2) h a h

14 Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K h e α(3,2) h a h

15 Pair HMM Inference: Forward-Backward
t=1 t=2 ... t=T v=1 v=2 v=K

16 EM to learn edit distances
Is this really like edit distances? Not really: Sim(x,x) ≠1 Generally sim(x,x) gets smaller with longer x Edit distance is based on single best sequence; Pr(x,y) is based on weighted cost of all successful edit sequences Will learning work? Unlike linear models no guarantee of global convergence: you might not find a good model even if it exists

17 Back to R&Y paper... They consider “coarse” and “detailed” models, as well as mixtures of both. Coarse model is like a back-off model – merge edit operations into equivalence classes (e.g. based on equivalence classes for chars). Test by learning distance for K-NN with an additional latent variable

18 K-NN with latent prototypes
test example y (a string of phonemes) learned phonetic distance possible prototypes x (known word pronounciation ) x1 x2 x3 xm words from dictionary w1 w2 wK

19 K-NN with latent prototypes
Method needs (x,y) pairs to train a distance – to handle this, an additional level of E/M is used to pick the “latent prototype” to pair with each y y learned phonetic distance x1 x2 x3 xm w1 w2 wK

20 Plan for this week Edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Why EM works Discriminative learning for pair HMMs

21 EM: X = data θ = model z = something you can’t observe Problem: “complete data likelihood” Algorithm: Iteratively improve θ1 θ2  … Θn= Mixturess: z is hidden mixture component … HMMs: z is hidden state sequence string Pair HMMs: z is hidden sequence of pairs (x1,y1),… given (x,y) Latent-variable topic models (e.g., LDA): z is assignment of words to topics ….

22 Jensen’s inequality…

23 Jensen’s inequality and f convex  x3

24 Jensen’s inequality ln x3

25 X = data θ = model z = something you can’t observe Let’s think about moving from θn (our current parameter vector) to some new θ (the next one, hopefully better) We want to optimize L(θ)- L(θn ) …. using something like…

26

27

28

29

30 Comments Nice because we often know how to
Do learning in the model (if hidden variables are known) Do inference in the model (to get hidden variables) And that’s all we need to do…. Convergence: local, not global Generalized EM: E but don’t M, just improve

31

32 Key ideas Pair of strings (x,y) associated with a label: {match,nonmatch} Classification done by a pair HMM with two non-initial states: {match, non-match} w/o transitions between them Model scores alignments – emissions sequences – as match/nonmatch.

33 Key ideas Score the alignment sequence: Edit sequence is featurized:
Marginalize over all alignments to score match v nonmatch:

34 Key ideas To learn, combine EM and CRF learning:
compute expectations over (hidden) alignments use LBFGS to maximize (or at least improve )the parameters, λ repeat…… Initialize the model with a “reasonable” set of parameters: hand-tuned parameters for matching strings copy match parameters to non-match state and shrink them to zero.

35 Results We will come back to this family of methods in a couple of weeks (discriminatively trained latent-variable models).


Download ppt "Edit Distances William W. Cohen."

Similar presentations


Ads by Google