NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen

NYU 2 Outline  Information Extraction: background  Problems in IE  Prior Work: Machine Learning for IE  Discover patterns from raw text  Experimental results  Current work

NYU 3 Quick Overview  What is Information Extraction ?  Definition: –finding facts about a specified class of events from free text –filling a table in a data base (slots in a template)  Events: instances in relations, with many arguments

NYU 4 –George Garrick, 40 years old, president of the London-based European Information Services Inc., was appointed chief executive officer of Nielsen Marketing Research, USA. Example: Management Succession

NYU 5 Example: Management Succession –George Garrick, 40 years old, president of the London-based European Information Services Inc., was appointed chief executive officer of Nielsen Marketing Research, USA.

NYU 6 discourse sentence Lexical Analysis System Architecture: Proteus Name Recognition Partial Syntax Scenario Patterns Reference Resolution Discourse Analyzer Output Generation Input Text Extracted Information

NYU 7 discourse sentence Lexical Analysis System Architecture: Proteus Name Recognition Partial Syntax Scenario Patterns Reference Resolution Discourse Analyzer Output Generation Input Text Extracted Information

NYU 8 Problems  Customization  Performance

NYU 9 Problems: Customization To customize a system for a new extraction task, we have to develop –new patterns for new types of events –word classes for the domain –inference rules This can be a large job requiring skilled labor –expense of customization limits uses of extraction

NYU 10 Problems: Performance  Performance on event IE is limited  On MUC tasks, typical top performance is recall < 55%, precision < 75%  Errors propagate through multiple phases: –name recognition errors –syntax analysis errors –missing patterns –reference resolution errors –complex inference required

NYU 11 Missing Patterns  As with many language phenomena –a few common patterns –a large number of rare patterns  Rare patterns do not surface sufficiently often in limited corpus  Missing patterns make customization expensive and limit performance  Finding good patterns is necessary to improve customization and performance Freq Rank

NYU 12 Prior Research  build patterns from examples –Yangarber ‘97  generalize from multiple examples: annotated text –Crystal, Whisk (Soderland), Rapier (Califf)  active learning: reduce annotation –Soderland ‘99, Califf ‘99  learning from corpus with relevance judgements –Riloff ‘96, ‘99  co-learning/bootstrapping –Brin ‘98, Agichtein ‘00

NYU 13 Our Goals  Minimize manual labor required to construct pattern bases for new domain –un-annotated text –un-classified text –un-supervised learning  Use very large corpora -- larger than we could ever tag manually -- to improve coverage of patterns

NYU 14 Principle: Pattern Density  If we have relevance judgements for documents in a corpus, for the given task, then the patterns which are much more frequent in relevant documents will generally be good patterns  Riloff (1996) finds patterns related to terrorist attacks

NYU 15 Principle: Duality  Duality between patterns and documents: –relevant documents are strong indicators of good patterns –good patterns are strong indicators of relevant documents

NYU 16 Outline of Procedure  Initial query: a small set of seed patterns which partially characterize the topic of interest repeat  Initial query: a small set of seed patterns which partially characterize the topic of interest  Retrieve documents containing seed patterns: “relevant documents”  Initial query: a small set of seed patterns which partially characterize the topic of interest  Retrieve documents containing seed patterns: “relevant documents”  Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency  Initial query: a small set of seed patterns which partially characterize the topic of interest  Retrieve documents containing seed patterns: “relevant documents”  Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency  Add top-ranked pattern to seed pattern set

17 #1: pick seed pattern Seed:

18 #2: retrieve relevant documents Seed:  Fred retired. ... Harry was named president.  Maki retired. ... Yuki was named president.  Relevant documents  Other documents

19 #3: pick new pattern Seed: appears in several relevant documents (top-ranked by Riloff metric)  Fred retired. ... Harry was named president.  Maki retired. ... Yuki was named president.

20 #4: add new pattern to pattern set Pattern set:

NYU 21 Pre-processing  For each document, find and classify names: –{ person | location | organization | …}  Parse document –(regularize passive, relative clauses, etc.)  For each clause, collect a candidate pattern: tuple: heads of –[ subject verb direct object object/subject complement locative and temporal modifiers … ]

NYU 22 Experiment  Task: Management succession (as MUC-6)  Source: Wall Street Journal  Training corpus: ~ 6,000 articles  Test corpus: –100 documents: MUC-6 formal training –+ 150 documents judged manually

NYU 23 Experiment: two seed patterns  v-appoint = { appoint, elect, promote, name }  v-resign = { resign, depart, quit, step-down }  Run discovery procedure for 80 iterations

NYU 24 Evaluation  Look at discovered patterns –new patterns, missed in manual training  Document filtering  Slot filling

NYU 25 Discovered patterns

NYU 26 Evaluation: new patterns  Not found in manual training

NYU 27 Evaluation: Text Filtering  How effective are discovered patterns at selecting relevant documents? –IR-style –documents matching at least one pattern

NYU 28

NYU 29

NYU 30 Evaluation: Slot filling  How effective are patterns within a complete IE system?  MUC-style IE on MUC-6 corpora  Caveat

NYU 33 Conclusion: Automatic discovery  Performance comparable to human (4-week development)  From un-annotated text: allows us to take advantage of very large corpora –redundancy –duality  Will likely help wider use of IE

NYU 34

NYU 35 Good Patterns  U - universe of all documents R - set of relevant documents H= H(p) - set of documents where pattern p matched  Density criterion:

NYU 36 Graded Relevance  Documents matching seed patterns considered 100% relevant  Discovered patterns are considered less certain  Documents containing them are considered partially relevant

NYU 37  document frequency in relevant documents overall document frequency  document frequency in relevant documents –(metrics similar to those used in Riloff-96) Scoring Patterns

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Similar presentations

Presentation on theme: "NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Similar presentations

Presentation on theme: "NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen."— Presentation transcript:

Similar presentations

About project

Feedback