Presentation is loading. Please wait.

Presentation is loading. Please wait.

Edit Distances William W. Cohen.

Similar presentations


Presentation on theme: "Edit Distances William W. Cohen."— Presentation transcript:

1 Edit Distances William W. Cohen

2 Plan for this week Edit distances Learning edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

3 Motivation Common problem: classify a pair of strings (s,t) as “these denote the same entity [or similar entities]” Examples: (“Carnegie-Mellon University”, “Carnegie Mellon Univ.”) (“Noah Smith, CMU”, “Noah A. Smith, Carnegie Mellon”) Applications: Co-reference in NLP Linking entities in two databases Removing duplicates in a database Finding related genes “Distant learning”: training NER from dictionaries

4 String distance metrics: Levenshtein
Edit-distance metrics Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) This is “Levenshtein distance”

5 Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 alignment t op cost

6 Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 gap alignment t op cost

7 Computing Levenshtein distance - 1
D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j) //insert D(i,j-1) //delete

8 Computing Levenshtein distance - 2
D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

9 Computing Levenshtein distance - 3
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j)= min C O H E N M 1 2 3 4 5 = D(s,t)

10 Computing Levenshtein distance – 4
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = min C O H E N M 1 2 3 4 5 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

11 Needleman-Wunch distance
d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) + G //insert D(i,j-1) + G //delete D(i,j) = min G = “gap cost” William Cohen Wukkuan Cigeb

12 Smith-Waterman distance - 1
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max Distance is maximum over all i,j in table of D(i,j)

13 Smith-Waterman distance - 2
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M -1 -2 -3 -4 -5 +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1

14 Smith-Waterman distance - 3
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1

15 Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)

16 Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions. Split large text fields by separators like commas, etc, and explore different pairings (since S-W assigns a large cost to large transpositions) Result competitive with plausible competitors.

17 Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
String s=A1 A2 ... AK, string t=B1 B2 ... BL sim’ is editDistance scaled to [0,1] Monge-Elkan’s “recursive matching scheme” is average maximal similarity of Ai to Bj:

18 Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)
0.51 computer science department stanford university palo alto california Dept. of Comput. Sci. Stanford Univ. CA USA 0.92 0.5 1.0

19

20

21

22 Results: S-W from Monge & Elkan

23 More edit distance tricks: Affine gap distances
Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen Intuitively, a single long insertion is “cheaper” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions

24 Affine gap distances - 2 Idea:
Current cost of a “gap” of n characters: nG Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.

25 Affine gap distances - 3 D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj)
IT(I-1,j-1) + d(si,tj) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = max IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’

26 Affine gap distances - 4 -B IS -d(si,tj) -A D -d(si,tj) -A -d(si,tj)

27 Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
Goal is to match data like this:

28 Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
Hand-tuned edit distance Lower costs for affine gaps Even lower cost for affine gaps near a “.” HMM-based normalization to group title, author, booktitle, etc into fields

29 Affine gap distances – experiments
TFIDF Edit Distance Adaptive Cora 0.751 0.839 0.945 0.721 0.964 OrgName1 0.925 0.633 0.923 0.366 0.950 0.776 Orgname2 0.958 0.571 0.778 0.912 0.984 Restaurant 0.981 0.827 1.000 0.967 0.867 Parks 0.976

30 Plan for this week Edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via dynamic programming Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

31 HMM Notation

32 HMM Example Sample output: xT=heehahaha, sT=122121212 Pr(1->2) 1 2
Pr(2->x) Pr(1->x) 1 2 a 0.3 e 0.5 o 0.2 d 0.3 h 0.5 b 0.2 Pr(2->1)

33 Review of HMM/CRFs Borkar et al: training data is (x,y) pairs, where y indicates “hidden” state sequence  learning is counting + smoothing CRFs: training data is (x,y) pairs, learning is optimizing gradient of likelihood  match feature frequencies with training (x,y) and expected frequencies in (x,y’):y’~Pr(y’|x,λ)  iteratively use forward-backward to compute expected probability of each state transition (and hence each feature) New case: training data is strings x, state sequence is unknown iteratively use forward-backward to compute expected transitions and emissions, and then learn by “soft” counting + smoothing

34 HMM Inference Key point: Pr(si=l) depends only on Pr(l’->l) and si-1 t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT

35 HMM Inference Key point: Pr(si=l) depends only on Pr(l’->l) and si-1 so you can propogate probabilities forward... t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT

36 HMM Inference – Forward Algorithm
.... l=2 ... l=K x1 x2 x3 xT

37 HMM Inference Forward algorithm: computes probabilities α(l,t) based on information in first t letters of string, ignores “downstream” information t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT

38 HMM Inference l=1 ... l=2 l=K x1 x2 x3 xT

39 HMM Learning – Baum Welsh
Repeat: Find Pr(si=l) for i=1,...,T using current θ’s Forward-backward algorithm Re-estimate θ’s θ(l’->l)= #(l’->l)/#(l’) But replace #(l’->l) and #(l’) with weighted versions of counts, based on Pr(si=l) from above θ(l’->l)= #(l’->x)/#(l’) But replace with weighted version

40 In more detail…forward backward

41 In more detail…EM for sequences

42 HMM Learning: special case of EM
Expectation maximization: Find expectations over hidden variables: Pr(Z=z) Here, forward backward algorithm hidden variables are states s at times t=1,...,t=T Maximize probability of parameters given expectations: Here, counting, replacing #(l’->l)/#(l’) and also #(l’->x)/#(l’) with weighted versions Very general technique

43 Why EM works…more later
x1 x2

44 Plan for this week Edit distances Learning edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

45 Pair HMM Notation

46 Pair HMM Example 1 e Pr(e) <a,a> 0.10 <e,e> <h,h>
0.05 <h,t> <-,h> 0.01 ... .. 1

47 Pair HMM Example e Pr(e) <a,a> 0.10 <e,e> <h,h> <e,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e> Strings x,y produced by zT: x=heehee, y=teehe Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings

48 Distances based on pair HMMs

49 Pair HMM Inference Dynamic programming is possible: fill out matrix left-to-right, top-down

50 Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K x1 x2 x3 xT

51 Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K

52 Pair HMM Inference v=1 v=2 v=K
One difference: after i emissions of pair HMM, we do not know the column position t=1 t=2 ... t=T v=1 v=2 v=K i=1 i=1 i=2 i=3 i=1 i=3

53 Pair HMM Inference: Forward-Backward
t=1 t=2 ... t=T v=1 v=2 v=K

54 Multiple states 2 1 3 e Pr(e) e Pr(e) e Pr(e) <a,a> 0.10
<h,h> <a,-> 0.05 2 e Pr(e) <a,a> 0.10 <e,e> <h,h> <a,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 e Pr(e) <a,a> 0.10 <e,e> <h,h> <a,-> 0.05 3

55 An extension: multiple states
conceptually, add a “state” dimension to the model ... v=K v=2 v=1 t=T t=2 t=1 l=2 ... v=K v=2 v=1 t=T t=2 t=1 l=1 EM methods generalize easily to this setting

56 EM to learn edit distances
Is this really like edit distances? Not really: Sim(x,x) ≠1 Generally sim(x,x) gets smaller with longer x Edit distance is based on single best sequence; Pr(x,y) is based on weighted cost of all successful edit sequences Will learning work? Unlike linear models no guarantee of global convergence: you might not find a good model even if it exists

57 Back to R&Y paper... They consider “coarse” and “detailed” models, as well as mixtures of both. Coarse model is like a back-off model – merge edit operations into equivalence classes (e.g. based on equivalence classes for chars). Test by learning distance for K-NN with an additional latent variable

58 K-NN with latent prototypes
test example y (a string of phonemes) learned phonetic distance possible prototypes x (known word pronounciation ) x1 x2 x3 xm words from dictionary w1 w2 wK

59 K-NN with latent prototypes
Method needs (x,y) pairs to train a distance – to handle this, an additional level of E/M is used to pick the “latent prototype” to pair with each y y learned phonetic distance x1 x2 x3 xm w1 w2 wK

60 Hidden prototype K-nn

61 Experiments E1: on-line pronounciation dictionary
E2: subset of E1 with corpus words E3: dictionary from training corpus E4: dictionary from training + test corpus (!) E5: E1 + E3

62 Experiments

63 Experiments

64 Plan for this week Edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

65

66 Key ideas Pair of strings (x,y) associated with a label: {match,nonmatch} Classification done by a pair HMM with two non-initial states: {match, non-match} w/o transitions between them Model scores alignments – emissions sequences – as match/nonmatch.

67 Key ideas Score the alignment sequence: Edit sequence is featurized:
Marginalize over all alignments to score match v nonmatch:

68 Key ideas To learn, combine EM and CRF learning:
compute expectations over (hidden) alignments use LBFGS to maximize (or at least improve )the parameters, λ repeat…… Initialize the model with a “reasonable” set of parameters: hand-tuned parameters for matching strings copy match parameters to non-match state and shrink them to zero.

69 Results

70 Plan for this week Edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

71

72 Jensen’s inequality…

73 Jensen’s inequality… x3

74

75

76

77 Comments Nice because we often know how to
Do learning in the model (if hidden variables are known) Do inference in the model (to get hidden variables) And that’s all we need to do…. Convergence: local, not global Generalized EM: E but don’t M, just improve


Download ppt "Edit Distances William W. Cohen."

Similar presentations


Ads by Google