Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU
Finding Patterns How can we collect patterns? Supervised learning –mark information to be extracted in text –collect information and context = specific patterns –generalize patterns Annotation quite expensive Zipfian distribution of patterns means that annotation of consecutive text is inefficient … the same pattern is annotated many times
Unsupervised learning? The intuition: if we collect documents D R relevant to the scenario, patterns relevant to the scenario will occur more frequently in D R than in the language as a whole (cf. sublanguage predicates in Harris’s distributional analysis)
Riloff ‘96 Corpus manually divided into relevant and irrelevant documents Collect patterns around each noun phrase Score patterns by R log F where R = relevance rate = freq in relevant docs / overall freq Select top-ranked patterns These patterns each find one template slot; combining filled slots into templates is a separate task
Extending the Discovery Procedure Finding relevant documents automatically –Yangarber … use patterns to select documents –Sudo … use keywords and IR engine Defining larger patterns (covering several template slots) –Yangarber … clause structures –Nobata; Sudo … larger structures
Automated Extraction Pattern Discovery Goal: find examples / patterns relevant to a given scenario without any corpus tagging (Yangarber ‘00) Method: –identify a few seed patterns for scenario –retrieve documents containing patterns –find subject-verb-object pattern with high frequency in retrieved documents relatively high frequency in retrieved docs vs. other docs –add pattern to seed and repeat
#1: pick seed pattern Seed:
#2: retrieve relevant documents Seed: Fred retired.... Harry was named president. Maki retired.... Yuki was named president. Relevant documents Other documents
#3: pick new pattern Seed: appears in several relevant documents (top-ranked by Riloff metric) Fred retired.... Harry was named president. Maki retired.... Yuki was named president.
#4: add new pattern to pattern set Pattern set: Note: new patterns added with confidence < 1
Experiment Task: Management succession (as MUC-6) Source: Wall Street Journal Training corpus: ~ 6,000 articles Test corpus: –100 documents: MUC-6 formal training –+ 150 documents judged manually
Pre-processing For each document, find and classify names: –{ person | location | organization | …} Parse document –(regularize passive, relative clauses, etc.) For each clause, collect a candidate pattern: tuple: heads of –[ subject verb direct object object/subject complement locative and temporal modifiers … ]
Experiment: two seed patterns v-appoint = { appoint, elect, promote, name } v-resign = { resign, depart, quit, step-down } Run discovery procedure for 80 iterations
Evaluation Look at discovered patterns –new patterns, missed in manual training Document filtering Slot filling
Discovered patterns
Evaluation: Text Filtering How effective are discovered patterns at selecting relevant documents? IR-style documents matching at least one pattern
How effective are patterns within a complete IE system? MUC-style IE on MUC-6 corpora Caveat: filtered / aligned by hand manual–MUC manual–now Evaluation: Slot filling