Kernels for Relation Extraction

Kernels for Relation Extraction
William Cohen

Outline for Today Quick review: SVMs & kernels
Bunescu & Mooney’s EMNLP 2005 paper: Hi-level representation (parsed text) Very simple path-based kernel Quick review: edit distances Bunescu & Mooney’s NIPS 2006 paper Shallower-representation (NE’s in context) Messier kernel based on string kernels Bunescu & Mooney’s ACL 2007 paper Another nice relation extraction paper, not a kernel paper

Perceptrons vs SVMs

If mistake: vk+1 = vk + yi xi
The voted perceptron Compute: yi = vk . xi ^ instance xi B A If mistake: vk+1 = vk + yi xi yi ^ yi

(3a) The guess v2 after the two positive examples: v2=v1+x2
(3b) The guess v2 after the one positive and one negative example: v2=v1-x2 v2 u u +x2 v2 >γ v1 v1 +x1 -x2 -u -u 2γ 2γ

Perceptrons vs SVMs For the voted perceptron to “work” (in this proof), we need to assume there is some u such that

Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: γ, (x1,y1), (x2,y2), (x3,y3), … Find: some w where ||w||=1 and for all i, w.xi.yi > γ

Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: (x1,y1), (x2,y2), (x3,y3), … Find: some w and γ such that ||w||=1 and for all i, w.xi.yi > γ The best possible w and γ

Perceptrons vs SVMs Question: why not use this assumption directly in the learning algorithm? i.e. Given: (x1,y1), (x2,y2), (x3,y3), … Maximize γ under the constraint ||w||=1 and for all i, w.xi.yi > γ Mimimize ||w||2 under the constraint for all i, w.xi.yi > 1 Units are arbitrary: rescaling increases γ and w

Perceptrons vs SVMs Variant:
Basic optimization problem: Given: (x1,y1), (x2,y2), (x3,y3), … Mimimize ||w||2 under the constraint for all i, w.xi.yi > 1 Variant: Ranking constraints (e.g., to model click-thru feedback): for all i,j~=l, w.xi.yi,l > w.xi.yi,j +1 But you have exponentially many constraints But Thorsten is a clever man & cutting plane method can be used

Review of Kernels

The kernel perceptron instance xi B A yi ^ yi Compute: yi = vk . xi ^ If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the kernel trick

The kernel perceptron instance xi B A yi ^ yi Compute: yi = vk . xi ^ If mistake: vk+1 = vk + yi xi Mathematically the same as before … but allows use of the “kernel trick” Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values.

Some common kernels Linear kernel: Polynomial kernel: Gaussian kernel:
More later….

Kernels 101 Duality Gram matrix Positive semi-definite
and computational properties Reproducing Kernel Hilbert Space (RKHS) Gram matrix Positive semi-definite Closure properties

Kernels 101 Duality: two ways to look at this
Implicitly map from x to φ(x) by changing the kernel function K Explicitly map from x to φ(x) – i.e. to the point corresponding to x in the Hilbert space Kernels 101 Duality: two ways to look at this Two different computational ways of getting the same behavior

Kernels 101 Duality Gram matrix: K: kij = K(xi,xj)
K(x,x’) = K(x’,x)  Gram matrix is symmetric K(x,x) > 0  diagonal of K is positive  K is “positive semi-definite”  zT K z >= 0 for all z Duality Gram matrix: K: kij = K(xi,xj)

Kernels 101 Duality Gram matrix: K: kij = K(xi,xj)
K(x,x’) = K(x’,x)  Gram matrix is symmetric K(x,x) > 0  diagonal of K is positive  K is “positive semi-definite”  …  zT K z >= 0 for all z Fun fact: Gram matrix positive semi-definite  K(xi,xj)=φ(xi), φ(xj) for some φ Proof: φ(x) uses the eigenvectors of K to represent x

Kernels 101 Duality Gram matrix Positive semi-definite
Closure properties: if K and K’ are kernels K + K’ is a kernel cK is a kernel, if c>0 aK + bK’ is a kernel, for a,b>0 …

Extracting Relationships with Kernels

What is “Information Extraction”
As a task: Filling slots in a database from sub-segments of text. 23rd July :51 GMT Microsoft was in violation of the GPL (General Public License) on the Hyper-V code it released to open source this week. After Redmond covered itself in glory by opening up the code, it now looks like it may have acted simply to head off any potentially embarrassing legal dispute over violation of the GPL. The rest was theater. As revealed by Stephen Hemminger - a principal engineer with open-source network vendor Vyatta - a network driver in Microsoft's Hyper-V used open-source components licensed under the GPL and statically linked to binary parts. The GPL does not permit the mixing of closed and open-source elements. … Hemminger said he uncovered the apparent violation and contacted Linux Driver Project lead Greg Kroah-Hartman, a Novell programmer, to resolve the problem quietly with Microsoft. Hemminger apparently hoped to leverage Novell's interoperability relationship with Microsoft. NAME TITLE ORGANIZATION Stephen Hemminger Greg Kroah-Hartman principal engineer programmer lead Vyatta Novell Linux Driver Proj. What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database.

What is “Information Extraction”
Techniques: NER + Segment + Classify EntityPairs from same segment 23rd July :51 GMT Hemminger said he uncovered the apparent violation and contacted Linux Driver Project lead Greg Kroah-Hartman, a Novell programmer, to resolve the problem quietly with Microsoft. Hemminger apparently hoped to leverage Novell's interoperability relationship with Microsoft. Hemminger Microsoft One-stage process: classify (E1,E2) as unrelated or employedBy, employerOf, hasTitle, titleOf, hasPosition, positionInCompany Two-stage process: classify (E1,E2) as related or not; classify related (E1,E2) as … Linux Driver Project What is IE. As a task it is… Starting with some text… and a empty data base with a defined ontology of fields and records, Use the information in the text to fill the database. programmer Novell lead Greg Kroah-Hartman

Bunescu & Mooney’s papers

Kernels vs Structured Output Spaces
Two kinds of structured prediction: HMMs, CRFs, VP-trained HMM, structured SVMs, stacked learning, ….: the output of the learner is structured. Eg for linear-chain CRF, the output is a sequence of labels—a string Yn Bunescu & Mooney (EMNLP, NIPS): the input to the learner is structured. EMNLP: structure derived from a dependency graph. New!

Tasks: ACE relations

Dependency graphs for sentences
holding seized Protesters stations workers several pumping 127 Shell hostage

Dependency graphs for sentences
CFG dependency parsers  dependency trees Context-sensitive formalisms  dependency DAGs

Disclaimer: this is a shortest path, not the shortest path

K( x1 × … × xn, y1 × … × yn ) = ( x1 × … × xn ) ∩ (y1 × … × yn) …
 x1 × x2 × x3 × x4 × x5 = 4*1*3*1*4 = 48 features x1 x2 x3 x4 x5 K( x1 × … × xn, y1 × … × yn ) = ( x1 × … × xn ) ∩ (y1 × … × yn) …

Results -CCG, -CFG: Context-sensitive CCG vs Collins’ (CFG) parser
S1, S2: one multi-class SVM vs two SVMs (binary, then multiclass) K4 is baseline (two stage SVM, custom kernel) Correct entity output is assumed

Now we come back to… edit distances

String distance metrics: Levenshtein
Edit-distance metrics for pairs of strings s,t Distance is shortest sequence of edit commands that transform s to t. Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) This is “Levenshtein distance”

Levenshtein distance - example
distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 gap alignment t op cost

Computing Levenshtein distance – 4
D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) //insert D(i,j-1) //delete D(i,j) = min C O H E N M 1 2 3 4 5

Now the NIPS paper Similar representation for relation instances: x1 × … × xn where each xi is a set…. …but instead of highly informative dependency path elements, the x’s just represent adjacent tokens. To compensate: use a richer kernel

Motivation Rules for protein-protein interaction like
“interaction of (gap0-3) <Protein1> with (gap0-3) <Protein2>” Used by prior rule-based system Add ability to match features of words (e.g., POS tags) Add constraints: match words before&between, between, between&after two proteins

Subsequence kernel set of all sparse subsequences u of x1 × … × xn with each u downweighted according to sparsity Relaxation of old kernel: We don’t have to match everywhere, just at selected locations For every position spanned by our matching pattern, we get a penalty of λ To pick a “feature” inside (x1 … xn)’ Pick a subset of locations i=i1,…,ik and then Pick a feature value in each location  In the preprocessed vector x’ weight every feature for i by λlength(i) = λik-i1+1

Subsequence kernel w/cost c(x,y)
Only counts u that align with last char of s and t

Dynamic programming computation
Kn(s,t): #matches between s and t of size n K’n(s,t): #matches between s and t of size n, scored as if final pos’n matched i.e., recursion “remembers” that “there a match to the right ” K’’n(s,t): #matches between s and t that match last char of s to something i.e. recursion “remembers” that “final char of s matches” Skipping position i in s Including position i Final pos’n of s not matched Final pos’n of s matched

Additional details Special domain-specific tricks for combining the subsequences for what matches in the fore, aft, and between sections of a relation-instance pair. Subsequences are of length less than 4. Is DP needed for this now? Count fore-between, between-aft, and between subsequences separately.

Results Protein-protein interaction ERK-A: no fore/aft sequences

Results

And now a further extension…
Multiple instance learning: Instance is a bag {x1,…,xn},y where each xi is a vector of features and If y is positive, some of the xi’s have a positive label If y is negative, none of the xi’s have a positive label. Approaches: EM, SVM techniques Their approach: treat all xi’s as positive examples but downweight the cost of misclassifying them.

Lp = total size of pos bags Ln = total size of negative bags
cp < 0.5 is a parameter Intercept term Slack variables

Collected with Google search queries, then sentence-segmented.
Datasets Collected with Google search queries, then sentence-segmented. This is terrible data since there lot of spurious correlations with Google, Adobe, …

Datasets Fix: downweight words in patterns u if they are strongly correlated with particular bags (eg the Google/Youtube bag).

Results

General comments If input is structured, what do you do?
Custom algorithm Design features And use YFCL Use a kernel that exploits the structure And use YFCL, as long as it’s kernel-based You may need to explicitly compute K If you need a new kernel Can combine existing kernels Can adapt existing kernels

General comments If input is structured, what do you do?
Use a kernel that exploits the structure And use YFCL, as long as it’s kernel-based Kernels exist for Strings Eg equivalent to a feature for each matching substring, weighted by length Edit distances Trees – equivalent to a feature for each matching subtree, …. Weighted transducers (rational kernels) Eg word lattices for speech Parameter sensitivity under a generative model (Fisher kernel) …

Kernels for Relation Extraction

Similar presentations

Presentation on theme: "Kernels for Relation Extraction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kernels for Relation Extraction

Similar presentations

Presentation on theme: "Kernels for Relation Extraction"— Presentation transcript:

Similar presentations

About project

Feedback