Edit Distances William W. Cohen.

Edit Distances William W. Cohen

Plan for this week Edit distances Learning edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

Motivation Common problem: classify a pair of strings (s,t) as “these denote the same entity [or similar entities]” Examples: (“Carnegie-Mellon University”, “Carnegie Mellon Univ.”) (“Noah Smith, CMU”, “Noah A. Smith, Carnegie Mellon”) Applications: Co-reference in NLP Linking entities in two databases Removing duplicates in a database Finding related genes “Distant learning”: training NER from dictionaries

String distance metrics: Levenshtein
Edit-distance metrics Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) This is “Levenshtein distance”

Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 alignment t op cost

Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 gap alignment t op cost

Computing Levenshtein distance - 1
D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j) //insert D(i,j-1) //delete

D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j)= min C O H E N M 1 2 3 4 5 = D(s,t)

Computing Levenshtein distance – 4
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = min C O H E N M 1 2 3 4 5 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Needleman-Wunch distance
d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) + G //insert D(i,j-1) + G //delete D(i,j) = min G = “gap cost” William Cohen Wukkuan Cigeb

Smith-Waterman distance - 1
//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max Distance is maximum over all i,j in table of D(i,j)

//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M -1 -2 -3 -4 -5 +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1

//start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete D(i,j) = max C O H E N M +1 +2 +4 +3 +5 G = 1 d(c,c) = -2 d(c,d) = +1

Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)

Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions. Split large text fields by separators like commas, etc, and explore different pairings (since S-W assigns a large cost to large transpositions) Result competitive with plausible competitors.

Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)
String s=A1 A2 ... AK, string t=B1 B2 ... BL sim’ is editDistance scaled to [0,1] Monge-Elkan’s “recursive matching scheme” is average maximal similarity of Ai to Bj:

Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)
0.51 computer science department stanford university palo alto california Dept. of Comput. Sci. Stanford Univ. CA USA 0.92 0.5 1.0

Results: S-W from Monge & Elkan

More edit distance tricks: Affine gap distances
Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen Intuitively, a single long insertion is “cheaper” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions

Affine gap distances - 2 Idea:
Current cost of a “gap” of n characters: nG Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.

Affine gap distances - 3 D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj)
IT(I-1,j-1) + d(si,tj) D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = max IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’

Affine gap distances - 4 -B IS -d(si,tj) -A D -d(si,tj) -A -d(si,tj)

Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
Goal is to match data like this:

Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)
Hand-tuned edit distance Lower costs for affine gaps Even lower cost for affine gaps near a “.” HMM-based normalization to group title, author, booktitle, etc into fields

Affine gap distances – experiments
TFIDF Edit Distance Adaptive Cora 0.751 0.839 0.945 0.721 0.964 OrgName1 0.925 0.633 0.923 0.366 0.950 0.776 Orgname2 0.958 0.571 0.778 0.912 0.984 Restaurant 0.981 0.827 1.000 0.967 0.867 Parks 0.976

Plan for this week Edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via dynamic programming Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

HMM Notation

HMM Example Sample output: xT=heehahaha, sT=122121212 Pr(1->2) 1 2
Pr(2->x) Pr(1->x) 1 2 a 0.3 e 0.5 o 0.2 d 0.3 h 0.5 b 0.2 Pr(2->1)

Review of HMM/CRFs Borkar et al: training data is (x,y) pairs, where y indicates “hidden” state sequence  learning is counting + smoothing CRFs: training data is (x,y) pairs, learning is optimizing gradient of likelihood  match feature frequencies with training (x,y) and expected frequencies in (x,y’):y’~Pr(y’|x,λ)  iteratively use forward-backward to compute expected probability of each state transition (and hence each feature) New case: training data is strings x, state sequence is unknown iteratively use forward-backward to compute expected transitions and emissions, and then learn by “soft” counting + smoothing

HMM Inference Key point: Pr(si=l) depends only on Pr(l’->l) and si-1 t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT

HMM Inference Key point: Pr(si=l) depends only on Pr(l’->l) and si-1 so you can propogate probabilities forward... t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT

HMM Inference – Forward Algorithm
.... l=2 ... l=K x1 x2 x3 xT

HMM Inference Forward algorithm: computes probabilities α(l,t) based on information in first t letters of string, ignores “downstream” information t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT

HMM Inference l=1 ... l=2 l=K x1 x2 x3 xT

HMM Learning – Baum Welsh
Repeat: Find Pr(si=l) for i=1,...,T using current θ’s Forward-backward algorithm Re-estimate θ’s θ(l’->l)= #(l’->l)/#(l’) But replace #(l’->l) and #(l’) with weighted versions of counts, based on Pr(si=l) from above θ(l’->l)= #(l’->x)/#(l’) But replace with weighted version

In more detail…forward backward

In more detail…EM for sequences

HMM Learning: special case of EM
Expectation maximization: Find expectations over hidden variables: Pr(Z=z) Here, forward backward algorithm hidden variables are states s at times t=1,...,t=T Maximize probability of parameters given expectations: Here, counting, replacing #(l’->l)/#(l’) and also #(l’->x)/#(l’) with weighted versions Very general technique

Why EM works…more later
x1 x2

Plan for this week Edit distances Learning edit distances
Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

Pair HMM Notation

Pair HMM Example 1 e Pr(e) <a,a> 0.10 <e,e> <h,h>
0.05 <h,t> <-,h> 0.01 ... .. 1

Pair HMM Example e Pr(e) <a,a> 0.10 <e,e> <h,h> <e,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e> Strings x,y produced by zT: x=heehee, y=teehe Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings

Distances based on pair HMMs

Pair HMM Inference Dynamic programming is possible: fill out matrix left-to-right, top-down

Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K x1 x2 x3 xT

Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K

Pair HMM Inference v=1 v=2 v=K
One difference: after i emissions of pair HMM, we do not know the column position t=1 t=2 ... t=T v=1 v=2 v=K i=1 i=1 i=2 i=3 i=1 i=3

Pair HMM Inference: Forward-Backward
t=1 t=2 ... t=T v=1 v=2 v=K

Multiple states 2 1 3 e Pr(e) e Pr(e) e Pr(e) <a,a> 0.10
<h,h> <a,-> 0.05 2 e Pr(e) <a,a> 0.10 <e,e> <h,h> <a,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 e Pr(e) <a,a> 0.10 <e,e> <h,h> <a,-> 0.05 3

An extension: multiple states
conceptually, add a “state” dimension to the model ... v=K v=2 v=1 t=T t=2 t=1 l=2 ... v=K v=2 v=1 t=T t=2 t=1 l=1 EM methods generalize easily to this setting

EM to learn edit distances
Is this really like edit distances? Not really: Sim(x,x) ≠1 Generally sim(x,x) gets smaller with longer x Edit distance is based on single best sequence; Pr(x,y) is based on weighted cost of all successful edit sequences Will learning work? Unlike linear models no guarantee of global convergence: you might not find a good model even if it exists

Back to R&Y paper... They consider “coarse” and “detailed” models, as well as mixtures of both. Coarse model is like a back-off model – merge edit operations into equivalence classes (e.g. based on equivalence classes for chars). Test by learning distance for K-NN with an additional latent variable

K-NN with latent prototypes
test example y (a string of phonemes) learned phonetic distance possible prototypes x (known word pronounciation ) x1 x2 x3 xm words from dictionary w1 w2 wK

K-NN with latent prototypes
Method needs (x,y) pairs to train a distance – to handle this, an additional level of E/M is used to pick the “latent prototype” to pair with each y y learned phonetic distance x1 x2 x3 xm w1 w2 wK

Hidden prototype K-nn

Experiments E1: on-line pronounciation dictionary
E2: subset of E1 with corpus words E3: dictionary from training corpus E4: dictionary from training + test corpus (!) E5: E1 + E3

Experiments

Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

Key ideas Pair of strings (x,y) associated with a label: {match,nonmatch} Classification done by a pair HMM with two non-initial states: {match, non-match} w/o transitions between them Model scores alignments – emissions sequences – as match/nonmatch.

Key ideas Score the alignment sequence: Edit sequence is featurized:
Marginalize over all alignments to score match v nonmatch:

Key ideas To learn, combine EM and CRF learning:
compute expectations over (hidden) alignments use LBFGS to maximize (or at least improve )the parameters, λ repeat…… Initialize the model with a “reasonable” set of parameters: hand-tuned parameters for matching strings copy match parameters to non-match state and shrink them to zero.

Results

Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Discriminative learning for pair HMMs Why EM works

Jensen’s inequality…

Jensen’s inequality… x3

Comments Nice because we often know how to
Do learning in the model (if hidden variables are known) Do inference in the model (to get hidden variables) And that’s all we need to do…. Convergence: local, not global Generalized EM: E but don’t M, just improve

Edit Distances William W. Cohen.

Similar presentations

Presentation on theme: "Edit Distances William W. Cohen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Edit Distances William W. Cohen.

Similar presentations

Presentation on theme: "Edit Distances William W. Cohen."— Presentation transcript:

Similar presentations

About project

Feedback