Download presentation
Presentation is loading. Please wait.
1
Edit Distances William W. Cohen
2
Plan for this week Edit distances Learning edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works
3
Motivation Common problem: classify a pair of strings (s,t) as “these denote the same entity [or similar entities]” Examples: (“Carnegie-Mellon University”, “Carnegie Mellon Univ.”) (“Noah Smith, CMU”, “Noah A. Smith, Carnegie Mellon”) Applications: Co-reference in NLP Linking entities in two databases Removing duplicates in a database Finding related genes “Distant learning”: training NER from dictionaries
4
String distance metrics: Levenshtein
Edit-distance metrics Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) This is “Levenshtein distance”
5
Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 alignment t op cost
6
Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 gap alignment t op cost
7
Computing Levenshtein distance - 1
D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j) //insert D(i,j-1) //delete
8
Computing Levenshtein distance - 2
D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j
9
Computing Levenshtein distance - 3
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j)= min C O H E N M 1 2 3 4 5 = D(s,t)
10
Computing Levenshtein distance – 4
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = min C O H E N M 1 2 3 4 5 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)
11
Needleman-Wunch distance
d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) + G //insert D(i,j-1) + G //delete D(i,j) = min G = “gap cost” William Cohen Wukkuan Cigeb
12
Smith-Waterman distance - 1
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max Distance is maximum over all i,j in table of D(i,j)
13
Smith-Waterman distance - 2
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M -1 -2 -3 -4 -5 +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1
14
Smith-Waterman distance - 3
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1
15
Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)
16
Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions. Split large text fields by separators like commas, etc, and explore different pairings (since S-W assigns a large cost to large transpositions) Result competitive with plausible competitors.
17
Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
String s=A1 A2 ... AK, string t=B1 B2 ... BL sim’ is editDistance scaled to [0,1] Monge-Elkan’s “recursive matching scheme” is average maximal similarity of Ai to Bj:
18
Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)
0.51 computer science department stanford university palo alto california Dept. of Comput. Sci. Stanford Univ. CA USA 0.92 0.5 1.0
22
Results: S-W from Monge & Elkan
23
More edit distance tricks: Affine gap distances
Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen Intuitively, a single long insertion is “cheaper” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions
24
Affine gap distances - 2 Idea:
Current cost of a “gap” of n characters: nG Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.
25
Affine gap distances - 3 D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj)
IT(I-1,j-1) + d(si,tj) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = max IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’
26
Affine gap distances - 4 -B IS -d(si,tj) -A D -d(si,tj) -A -d(si,tj)
27
Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
Goal is to match data like this:
28
Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
Hand-tuned edit distance Lower costs for affine gaps Even lower cost for affine gaps near a “.” HMM-based normalization to group title, author, booktitle, etc into fields
29
Affine gap distances – experiments
TFIDF Edit Distance Adaptive Cora 0.751 0.839 0.945 0.721 0.964 OrgName1 0.925 0.633 0.923 0.366 0.950 0.776 Orgname2 0.958 0.571 0.778 0.912 0.984 Restaurant 0.981 0.827 1.000 0.967 0.867 Parks 0.976
30
Plan for this week Edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via dynamic programming Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works
31
HMM Notation
32
HMM Example Sample output: xT=heehahaha, sT=122121212 Pr(1->2) 1 2
Pr(2->x) Pr(1->x) 1 2 a 0.3 e 0.5 o 0.2 d 0.3 h 0.5 b 0.2 Pr(2->1)
33
Review of HMM/CRFs Borkar et al: training data is (x,y) pairs, where y indicates “hidden” state sequence learning is counting + smoothing CRFs: training data is (x,y) pairs, learning is optimizing gradient of likelihood match feature frequencies with training (x,y) and expected frequencies in (x,y’):y’~Pr(y’|x,λ) iteratively use forward-backward to compute expected probability of each state transition (and hence each feature) New case: training data is strings x, state sequence is unknown iteratively use forward-backward to compute expected transitions and emissions, and then learn by “soft” counting + smoothing
34
HMM Inference Key point: Pr(si=l) depends only on Pr(l’->l) and si-1 t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT
35
HMM Inference Key point: Pr(si=l) depends only on Pr(l’->l) and si-1 so you can propogate probabilities forward... t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT
36
HMM Inference – Forward Algorithm
.... l=2 ... l=K x1 x2 x3 xT
37
HMM Inference Forward algorithm: computes probabilities α(l,t) based on information in first t letters of string, ignores “downstream” information t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT
38
HMM Inference l=1 ... l=2 l=K x1 x2 x3 xT
39
HMM Learning – Baum Welsh
Repeat: Find Pr(si=l) for i=1,...,T using current θ’s Forward-backward algorithm Re-estimate θ’s θ(l’->l)= #(l’->l)/#(l’) But replace #(l’->l) and #(l’) with weighted versions of counts, based on Pr(si=l) from above θ(l’->l)= #(l’->x)/#(l’) But replace with weighted version
40
In more detail…forward backward
41
In more detail…EM for sequences
42
HMM Learning: special case of EM
Expectation maximization: Find expectations over hidden variables: Pr(Z=z) Here, forward backward algorithm hidden variables are states s at times t=1,...,t=T Maximize probability of parameters given expectations: Here, counting, replacing #(l’->l)/#(l’) and also #(l’->x)/#(l’) with weighted versions Very general technique
43
Why EM works…more later
x1 x2
44
Plan for this week Edit distances Learning edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works
45
Pair HMM Notation
46
Pair HMM Example 1 e Pr(e) <a,a> 0.10 <e,e> <h,h>
0.05 <h,t> <-,h> 0.01 ... .. 1
47
Pair HMM Example e Pr(e) <a,a> 0.10 <e,e> <h,h> <e,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e> Strings x,y produced by zT: x=heehee, y=teehe Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings
48
Distances based on pair HMMs
49
Pair HMM Inference Dynamic programming is possible: fill out matrix left-to-right, top-down
50
Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K x1 x2 x3 xT
51
Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K
52
Pair HMM Inference v=1 v=2 v=K
One difference: after i emissions of pair HMM, we do not know the column position t=1 t=2 ... t=T v=1 v=2 v=K i=1 i=1 i=2 i=3 i=1 i=3
53
Pair HMM Inference: Forward-Backward
t=1 t=2 ... t=T v=1 v=2 v=K
54
Multiple states 2 1 3 e Pr(e) e Pr(e) e Pr(e) <a,a> 0.10
<h,h> <a,-> 0.05 2 e Pr(e) <a,a> 0.10 <e,e> <h,h> <a,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 e Pr(e) <a,a> 0.10 <e,e> <h,h> <a,-> 0.05 3
55
An extension: multiple states
conceptually, add a “state” dimension to the model ... v=K v=2 v=1 t=T t=2 t=1 l=2 ... v=K v=2 v=1 t=T t=2 t=1 l=1 EM methods generalize easily to this setting
56
EM to learn edit distances
Is this really like edit distances? Not really: Sim(x,x) ≠1 Generally sim(x,x) gets smaller with longer x Edit distance is based on single best sequence; Pr(x,y) is based on weighted cost of all successful edit sequences Will learning work? Unlike linear models no guarantee of global convergence: you might not find a good model even if it exists
57
Back to R&Y paper... They consider “coarse” and “detailed” models, as well as mixtures of both. Coarse model is like a back-off model – merge edit operations into equivalence classes (e.g. based on equivalence classes for chars). Test by learning distance for K-NN with an additional latent variable
58
K-NN with latent prototypes
test example y (a string of phonemes) learned phonetic distance possible prototypes x (known word pronounciation ) x1 x2 x3 xm words from dictionary w1 w2 wK
59
K-NN with latent prototypes
Method needs (x,y) pairs to train a distance – to handle this, an additional level of E/M is used to pick the “latent prototype” to pair with each y y learned phonetic distance x1 x2 x3 xm w1 w2 wK
60
Hidden prototype K-nn
61
Experiments E1: on-line pronounciation dictionary
E2: subset of E1 with corpus words E3: dictionary from training corpus E4: dictionary from training + test corpus (!) E5: E1 + E3
62
Experiments
63
Experiments
64
Plan for this week Edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works
66
Key ideas Pair of strings (x,y) associated with a label: {match,nonmatch} Classification done by a pair HMM with two non-initial states: {match, non-match} w/o transitions between them Model scores alignments – emissions sequences – as match/nonmatch.
67
Key ideas Score the alignment sequence: Edit sequence is featurized:
Marginalize over all alignments to score match v nonmatch:
68
Key ideas To learn, combine EM and CRF learning:
compute expectations over (hidden) alignments use LBFGS to maximize (or at least improve )the parameters, λ repeat…… Initialize the model with a “reasonable” set of parameters: hand-tuned parameters for matching strings copy match parameters to non-match state and shrink them to zero.
69
Results
70
Plan for this week Edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works
72
Jensen’s inequality…
73
Jensen’s inequality… x3
77
Comments Nice because we often know how to
Do learning in the model (if hidden variables are known) Do inference in the model (to get hidden variables) And that’s all we need to do…. Convergence: local, not global Generalized EM: E but don’t M, just improve
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.