Relation Extraction William Cohen 10-18. Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.

Slides:



Advertisements
Similar presentations
Chapter 7 Dynamic Programming.
Advertisements

Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Support Vector Machines and The Kernel Trick William Cohen
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Distance Functions for Sequence Data and Time Series
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Multiple Sequence alignment Chitta Baral Arizona State University.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Class 2: Basic Sequence Alignment
15-853:Algorithms in the Real World
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Introduction to Profile Hidden Markov Models
Distance functions and IE -2 William W. Cohen CALD.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Minimum Edit Distance Definition of Minimum Edit Distance.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Expected accuracy sequence alignment Usman Roshan.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Part 2 # 68 Longest Common Subsequence T.H. Cormen et al., Introduction to Algorithms, MIT press, 3/e, 2009, pp Example: X=abadcda, Y=acbacadb.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Learning Analogies and Semantic Relations Nov William Cohen.
Edit Distances William W. Cohen.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
Minimum Edit Distance Definition of Minimum Edit Distance.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Distance functions and IE - 3 William W. Cohen CALD.
Learning to Align: a Statistical Approach
Definition of Minimum Edit Distance
Relation Extraction CSCI-GA.2591
Distance Functions for Sequence Data and Time Series
String Processing.
Edit Distances William W. Cohen.
School of Computer Science & Engineering
Kernels for Relation Extraction
Kernels for Relation Extraction
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
String Processing.
The Voted Perceptron for Ranking and Structured Classification
15-826: Multimedia Databases and Data Mining
Presentation transcript:

Relation Extraction William Cohen 10-18

Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured SVMs, stacked learning, ….: the output of the learner is structured. Eg for linear-chain CRF, the output is a sequence of labels—a string Y n –Bunescu & Mooney (EMNLP, NIPS): the input to the learner is structured. EMNLP: structure derived from a dependency graph. New!

 x 1 × x 2 × x 3 × x 4 × x 5 = 4*1*3*1*4 = 48 features x1x1 x2x2 x3x3 x4x4 x5x5 … K( x 1 × … × x n, y 1 × … × y n ) = ( x 1 × … × x n ) ∩ (y 1 × … × y n ) x  H(x)

and the NIPS paper… Similar representation for relation instances: x 1 × … × x n where each x i is a set…. …but instead of informative dependency path elements, the x’s just represent adjacent tokens. To compensate: use a richer kernel

Background: edit distances

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC s t op cost alignment

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC s t op cost alignment gap

Computing Levenshtein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete

Computing Levenstein distance - 2 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

Computing Levenstein distance - 3 D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t) M~__ C~ C~C O~O H~H ~E N~N

Computing Levenstein distance - 3 D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t)

Computing Levenshtein distance – 4 D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M C C O H N A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Computing Levenstein distance - 3 D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t)

Affine gap distances Levenshtein fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen

Affine gap distances - 2 Idea: –Current cost of a “gap” of n characters: nG –Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.

Computing Levenstein distance - variant D(i,j) = score of best alignment from s1..si to t1..tj = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete d(x,x) = 2 d(x,y) = -1 if x!=y = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete d(x,x) = 0 d(x,y) = 1 if x!=y

Affine gap distances - 3 D(i,j) = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’ D(i-1,j-1) + d(si,tj) IS(i-1,j-1) + d(si,tj) IT(i-1,j-1) + d(si,tj)

Affine gap distances - 4 -B -d(si,tj) D IS IT -d(si,tj) -A

Back to subsequence kernels

Subsequence kernel  set of all sparse subsequences u of x 1 × … × x n with each u downweighted according to sparsity Relaxation of old kernel: 1.We don’t have to match everywhere, just at selected locations 2.For every position in the pattern, we get a penalty of λ To pick a “feature” inside ( x 1 … x n )’  Pick a subset of locations i=i 1,…,i k and then  Pick a feature value in each location 1.In the preprocessed vector x’ weight every feature for i by λ length(i) = λ ik-i1+1

Subsequence kernel or

Example 1-Nop72-binds3-readily4-to5-the6-ribosomal7-protein8-YTM1 1-Erb12-binds3-to4-YTM1

Example 1-Nop72-binds3-readily4-to5-the6-ribosomal7-protein8-YTM1 1-Erb12-binds3-to4-YTM1

Subsequence kernels w/o features Example strings: –“Elvis Presley was born on Jan 8”  s1) PERSON was born on DATE. –“William Cohen was born in New York City on April 6”  s2) PERSON was born in LOCATION on DATE. Plausible pattern: –PERSON was born … on DATE. What we’ll actually learn: –u = PERSON … was … born … on … DATE. –u matches s if exists i=i 1,…,i n so that s[i]=s[i 1 ]…s[i n ]=u –For string s1, i=1234. For string s2, i=12367 i=i 1,…,i n are increasing indices in s [Lohdi et al, JMLR 2002]

Subsequence kernels w/o features s1) PERSON was born on DATE. s2) PERSON was born in LOCATION on DATE. Pattern: –u = PERSON … was … born … on … DATE. –u matches s if exists i=i 1,…,i n so that s[i]=s[i 1 ]…s[i n ]=u –For string s1, i=1234. For string s2, i=12367 How to we say that s1 matches better than s2? –Weight a match of s to u by λ length(i) where length(i)=i n -i 1 +1 Now let’s define K(s,t) = the sum over all u that match both s and t of matchWeight(u,s)*matchweight(u,t)

K’ i (s,t) = “we’re paying the λ penalty now” …. #patterns u of length i that match s and t where the pattern extends to the end of s. These recursions allow dynamic programming

Subsequence kernel with features  set of all sparse subsequences u of x 1 × … × x n with each u downweighted according to sparsity Relaxation of old kernel: 1.We don’t have to match everywhere, just at selected locations 2.For every position we decide to match at, we get a penalty of λ To pick a “feature” inside ( x 1 … x n )’ 1.Pick a subset of locations i=i 1,…,i k and then 2.Pick a feature value in each location 3.In the preprocessed vector x’ weight every feature for i by λ length(i) = λ ik-i1+1

Subsequence kernel w/ features or Where c(x,y) = Number of ways x and y match (i.e number of common features)

all j * c(x,t[j]) Number of ways x and t[j] match (i.e number of common features)

all j * c(x,t[j])

Additional details Special domain-specific tricks for combining the subsequences for what matches in the fore, aft, and between sections of a relation-instance pair. –Subsequences are of length less than 4. Is DP needed for this now? –Count fore-between, between-aft, and between subsequences separately.

Results Protein-protein interaction

And now a further extension… Suppose we don’t have annotated data, but we do know which proteins interact –This is actually pretty reasonable We can find examples of sentences with p1,p2 that don’t interact, and be pretty sure they are negative. We can find example strings for interacting p1, p2, eg. “ phosphorilates ”, but we can’t be sure they are all positive examples of a relation.

And now a further extension… Multiple instance learning: –Instance is a bag {x1,…,xn},y where each xi is a vector of features and If y is positive, some of the xi’s have a positive label If y is negative, none of the xi’s have a positive label. –Approaches: EM, SVM techniques –Their approach: treat all xi’s as positive examples but downweight the cost of misclassifying them.

Intercept term Slack variables Lp = total size of pos bags Ln = total size of negative bags c p < 0.5 is a parameter

Datasets Collected with Google search queries, then sentence-segmented. This is terrible data since there lot of spurious correlations with Google, Adobe, …

Datasets Fix: downweight words in patterns u if they are strongly correlated with particular bags (eg the Google/Youtube bag).

Results