Presentation is loading. Please wait.

Presentation is loading. Please wait.

Relation Extraction William Cohen 10-18. Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.

Similar presentations


Presentation on theme: "Relation Extraction William Cohen 10-18. Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured."— Presentation transcript:

1 Relation Extraction William Cohen 10-18

2 Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured SVMs, stacked learning, ….: the output of the learner is structured. Eg for linear-chain CRF, the output is a sequence of labels—a string Y n –Bunescu & Mooney (EMNLP, NIPS): the input to the learner is structured. EMNLP: structure derived from a dependency graph. New!

3

4  x 1 × x 2 × x 3 × x 4 × x 5 = 4*1*3*1*4 = 48 features x1x1 x2x2 x3x3 x4x4 x5x5 … K( x 1 × … × x n, y 1 × … × y n ) = ( x 1 × … × x n ) ∩ (y 1 × … × y n ) x  H(x)

5 and the NIPS paper… Similar representation for relation instances: x 1 × … × x n where each x i is a set…. …but instead of informative dependency path elements, the x’s just represent adjacent tokens. To compensate: use a richer kernel

6 Background: edit distances

7 Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC 00001111111122 s t op cost alignment

8 Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC 00001111111122 s t op cost alignment gap

9 Computing Levenshtein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete

10 Computing Levenstein distance - 2 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

11 Computing Levenstein distance - 3 D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t) M~__ C~ C~C O~O H~H ~E N~N

12 Computing Levenstein distance - 3 D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t)

13 Computing Levenshtein distance – 4 D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M 1 2345 C 1 2345 C 2 3345 O 3 2 345 H 43 2 3 4 N 5433 3 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

14 Computing Levenstein distance - 3 D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t)

15 Affine gap distances Levenshtein fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen

16 Affine gap distances - 2 Idea: –Current cost of a “gap” of n characters: nG –Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.

17 Computing Levenstein distance - variant D(i,j) = score of best alignment from s1..si to t1..tj = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete d(x,x) = 2 d(x,y) = -1 if x!=y = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete d(x,x) = 0 d(x,y) = 1 if x!=y

18 Affine gap distances - 3 D(i,j) = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’ D(i-1,j-1) + d(si,tj) IS(i-1,j-1) + d(si,tj) IT(i-1,j-1) + d(si,tj)

19 Affine gap distances - 4 -B -d(si,tj) D IS IT -d(si,tj) -A

20 Back to subsequence kernels

21 Subsequence kernel  set of all sparse subsequences u of x 1 × … × x n with each u downweighted according to sparsity Relaxation of old kernel: 1.We don’t have to match everywhere, just at selected locations 2.For every position in the pattern, we get a penalty of λ To pick a “feature” inside ( x 1 … x n )’  Pick a subset of locations i=i 1,…,i k and then  Pick a feature value in each location 1.In the preprocessed vector x’ weight every feature for i by λ length(i) = λ ik-i1+1

22 Subsequence kernel or

23 Example 1-Nop72-binds3-readily4-to5-the6-ribosomal7-protein8-YTM1 1-Erb12-binds3-to4-YTM1

24 Example 1-Nop72-binds3-readily4-to5-the6-ribosomal7-protein8-YTM1 1-Erb12-binds3-to4-YTM1

25 Subsequence kernels w/o features Example strings: –“Elvis Presley was born on Jan 8”  s1) PERSON was born on DATE. –“William Cohen was born in New York City on April 6”  s2) PERSON was born in LOCATION on DATE. Plausible pattern: –PERSON was born … on DATE. What we’ll actually learn: –u = PERSON … was … born … on … DATE. –u matches s if exists i=i 1,…,i n so that s[i]=s[i 1 ]…s[i n ]=u –For string s1, i=1234. For string s2, i=12367 i=i 1,…,i n are increasing indices in s [Lohdi et al, JMLR 2002]

26 Subsequence kernels w/o features s1) PERSON was born on DATE. s2) PERSON was born in LOCATION on DATE. Pattern: –u = PERSON … was … born … on … DATE. –u matches s if exists i=i 1,…,i n so that s[i]=s[i 1 ]…s[i n ]=u –For string s1, i=1234. For string s2, i=12367 How to we say that s1 matches better than s2? –Weight a match of s to u by λ length(i) where length(i)=i n -i 1 +1 Now let’s define K(s,t) = the sum over all u that match both s and t of matchWeight(u,s)*matchweight(u,t)

27 K’ i (s,t) = “we’re paying the λ penalty now” …. #patterns u of length i that match s and t where the pattern extends to the end of s. These recursions allow dynamic programming

28 Subsequence kernel with features  set of all sparse subsequences u of x 1 × … × x n with each u downweighted according to sparsity Relaxation of old kernel: 1.We don’t have to match everywhere, just at selected locations 2.For every position we decide to match at, we get a penalty of λ To pick a “feature” inside ( x 1 … x n )’ 1.Pick a subset of locations i=i 1,…,i k and then 2.Pick a feature value in each location 3.In the preprocessed vector x’ weight every feature for i by λ length(i) = λ ik-i1+1

29 Subsequence kernel w/ features or Where c(x,y) = Number of ways x and y match (i.e number of common features)

30

31 all j * c(x,t[j]) Number of ways x and t[j] match (i.e number of common features)

32 all j * c(x,t[j])

33 Additional details Special domain-specific tricks for combining the subsequences for what matches in the fore, aft, and between sections of a relation-instance pair. –Subsequences are of length less than 4. Is DP needed for this now? –Count fore-between, between-aft, and between subsequences separately.

34 Results Protein-protein interaction

35 And now a further extension… Suppose we don’t have annotated data, but we do know which proteins interact –This is actually pretty reasonable We can find examples of sentences with p1,p2 that don’t interact, and be pretty sure they are negative. We can find example strings for interacting p1, p2, eg. “ phosphorilates ”, but we can’t be sure they are all positive examples of a relation.

36 And now a further extension… Multiple instance learning: –Instance is a bag {x1,…,xn},y where each xi is a vector of features and If y is positive, some of the xi’s have a positive label If y is negative, none of the xi’s have a positive label. –Approaches: EM, SVM techniques –Their approach: treat all xi’s as positive examples but downweight the cost of misclassifying them.

37

38 Intercept term Slack variables Lp = total size of pos bags Ln = total size of negative bags c p < 0.5 is a parameter

39 Datasets Collected with Google search queries, then sentence-segmented. This is terrible data since there lot of spurious correlations with Google, Adobe, …

40 Datasets Fix: downweight words in patterns u if they are strongly correlated with particular bags (eg the Google/Youtube bag).

41 Results


Download ppt "Relation Extraction William Cohen 10-18. Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured."

Similar presentations


Ads by Google