Presentation on theme: "Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park."— Presentation transcript:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, 65-79 1997 Summarized by Seong-Bae Park
Information Extraction Particular natural language understanding task Inherently domain-specific Input : unrestricted text Output : information in a structured form Skim a text to find relevant sections and focus only on these sections
Architecture (1) Tokenizing and Tagging Sentence Analysis Phrase Identification Simple Grammatical Relation Find and Label semantic entities relevant to the extraction topic Difference to traditional parsers In IE, we need not a complete, detailed parse tree. Extraction Identify domain-specific relations among relevant entities.
Architecture (2) Merging The main job : Coreference Resolution (Anaphora Resolution) Optional : Implicit Subject of All Subjects Template Generation Determine the number of distinct events Map the individually extracted pieces onto each event Produce output templates
Role of Corpus-Based Language Learning Algorithms Catch Obtaining enough training data For language tasks Annotated corpora like Penn Treebank Some problems Learning extraction patterns, coreference resolution, template generation Difficult to Apply ML techniques No Corpora annotated Semantic and Domain-specific language processing skill is needed.
Learning Extraction Patterns Good Pattern General enough to extract the correct information from more than one sentence Specific enough not to apply in inappropriate contexts A number of learning methods The class of patterns learned The training corpus required The amount and type of human feedback required The degree of preprocessing necessary The background knowledge required The biases inherent in the learning algorithm itself
AutoSlog (1) Learns extraction patterns in the form of domain-specific “concept node” definitions CIRCUS Parser Concept Node Domain-specific semantic case frames that contain a maximum of one slot per frame
AutoSlog (2) One-shot learning algorithm Training Corpus A set of texts with noun phrases annotated with the appropriate concept type Associated answer keys as in MUC corpus Required Partial parser A small(approximately 13) set of general linguistic patterns
AutoSlog (3) To derive a pattern for extracting the phrase: 1. Find the sentence from which the NP originated. 2. Present the sentence to the partial parser for processing. 3. Apply the linguistic patters in order. Identify thematic role based on the syntactic position. 4. When a pattern applies, generate a concept node definition from the matched constituents, their context, the concept type provided in the annotation for the target NP, and the predefined semantic class for the filler.
Other System PALKA (Kim and Moldovan, 1995) Background knowledge Concept hierarchy a set of predefined keywords that can be used to trigger each pattern and a semantic class lexicon CRYSTAL (Soderland et al. 1995) Learn extraction patterns in the form of semantic case frames Huffman’s LIEP system
Coreference Resolution (1) An Example from MUC-6 Major weakness of existing IE systems Use manually generated heuristics (Generalization?) Assume input is fully parsed With Grammatical Function, Thematic Roles available The error is accumulated by sentence after sentence. Must be able to handle the myriad forms of coreference
Coreference Resolution (2) Empirical Method Inductive learning algorithms can be applied MLR (Aone and Bennett, 1995) : on Japanese RESOLVE (McCarthy and Lehnert, 1995) : on English C4.5 as learning algorithm Dataset MLR : Automatically Generated RESOLVE : Manually Generated, noise-free
Coreference Resolution (3) MLR Feature Set 66 features (1) lexical features of each phrase (2) the grammatical role of the phrase (3) semantic class information (4) relative positional information (5) whether each phrase contains a proper name (2 features) (6) whether one or both phrases refer to the entity formed by a joint venture (3 features) (7) whether one phrase contains an alias of the other (1 feature) (8) whether the phrases have the same base noun phrase (1 feature) (9) whether the phrases originate from the same sentence (1 feature) (1) ~ (4) : domain independent
Coreference Resolution (4) Test of MLR and RESOLVE Evaluated using 50 ~ 250 texts RESOLVE Recall : 80 ~ 85% Precision : 87 ~ 92% Default (Always negative) : about 74% MLR Recall : 67 ~ 70% Precision : 83 ~ 88% Both significantly outperforms IE systems manually developed.
Coreference Resolution (5) Much research to do yet Should be tested on additional types of anaphors Without domain-specific information (?) Relative errors from the preceding phases must be investigated. Few attempt for other discourse-level problems
Future Directions Research in IE is very new. Applying ML algorithms is even newer. A number of exciting directions Unsupervised Learning for sidestepping the lack of corpora How to eliminate NLP experts in moving IE systems to other domains?