Relation Extraction CSCI-GA.2591

Relation Extraction CSCI-GA.2591
NYU Relation Extraction CSCI-GA.2591 Ralph Grishman

ACE Relations An ACE relation mention connects two entity mentions in the same sentence: the CEO of Microsoft  OrgAff:employment(the CEO of MIcrosoft, Microsoft) in the West Bank, a passenger was wounded  Phys:Located(a passenger, the WestBank) ACE 2005 had 6 types of relations and 18 subtypes most papers report on types only Most relations are local … in roughly 70% of relations, arguments are adjacent or separated by one word so chunking is important but full parsing is not critical

Benchmarks ACE 2003 / 2003 / 2005 corpora Semeval-2010 task 8
generally assuming perfect entity mentions on input some work assumes only position (and not semantic type) is given Semeval-2010 task 8 carefully selected examples of 10 relations a classification task

Using MaxEnt First description of an ACE relation extractor
IBM system [Kambhatla ACL 2004] Used features: words entity type mention level overlap dependency tree parse tree used 2003 ACE data F = 55 (perfect mentions) 23 (system mentions) good system mentions are important

Lots of features Singapore system [Zhou et al. ACL 2005] used a very rich feature set, including 11 chunk-based features family-relative feature 2 country-name features 7 dependency-based features . . . highly tuned to ACE task F = 68 (relation type) F = 55 (subtype) reports several % gain over IBM used perfect mentions further extended at NYU, on ACE 2004: F=70.1

Kernel methods and SVMs
As an alternative to a feature-based model, one can provide a kernel function: a similarity function between pairs of the objects being classified kernel can be used directly by a kNN nearest neighbor classifier or can be used in training an SVM [Support Vector Machine]

SVM The SVM, when trained, creates a separating hyperplane
if data is fully separable, all data on one side of the hyperplane are classified +, on the other side – inherently binary classifier

Benefit of kernel methods
provides a natural way of handling structured input of variable size: sequences and trees feature-based system may require a large number of features for the same effect

Shortest-path kernel [Bunescu & Mooney EMNLP 2005] Sept 2002 corpus
Based on dependency path between arguments Kernel function between two paths x and y of lengths m and n c = degree of match (lexical / POS) Train SVM F = 52.5

Tree kernel To take account of more of the tree than the dependency path, use PET (path-enclosed tree) PET = Portion of tree enclosed by shortest path Using entire sentence tree introduces too much irrelevant data Use a tree kernel which recursively compares the two trees For example, counts number of shared subtrees Best kernel is a composite kernel: tree kernel + entity kernel

Lexical Generalization
Test data will include words not seen in training Remedies Use lemmas Use Brown clusters Use word embedings Can be used with feature-based or kernel-based methods

FCM Feature-Rich Compositional Embedding Models
Combines word embedding and hand-made discrete features: where e is the word embedding vector f is a vector of hand-coded features T is a matrix of weights If e is fixed during training, this is a feature-rich log linear model

Neural Network neural networks
provide a richer model than logLinear reduce the need for feature engineering although it may help to add features to embeddings but are slow to train and hard to inspect several types of networks have been used convolutional NNs recurrent NNs an ensemble of different NN types appears most effective may even include log linear model in ensemble

Some comparisons ACE 2005, train nw+bn, test bc, perfect mentions, including entity types LogLinear system FCM hybrid FCM CNN NN ensemble The richer model of even a simple NN beats a log linear (maxent system) [Nguyen and Grishman, IJCAI Workshop 2016]

Comparing scores Using subset of ACE 2005 (news) Feature-based system
Perfect mention position but no type info Baseline Single Brown Cluster Multiple clusters Word Embedding (WE) Multiple clusters + WE Mult. clusters + WE + regularization 59.4 Moral: lexical generalization & regularization are worthwhile (probably for all ACE tasks) [Nguyen & Grishman ACL 2014]

Distant Supervision We have focused on supervised methods, which produce the best performance If we have a large data base with instances of the relations of interest, we can use distant supervision Use data base to tag corpus If DB has relation R(x,y), tag all sentences in corpus containing x and y as examples of R Train model from tagged corpus

Distant Supervision By itself, distant supervision is too noisy
If the same pair <x, y> is connected by several relations, which one to we label? But it can be combined with selective manual annotation to produce a satisfactory result

Relation Extraction CSCI-GA.2591

Similar presentations

Presentation on theme: "Relation Extraction CSCI-GA.2591"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Relation Extraction CSCI-GA.2591

Similar presentations

Presentation on theme: "Relation Extraction CSCI-GA.2591"— Presentation transcript:

Similar presentations

About project

Feedback