Presentation is loading. Please wait.

Presentation is loading. Please wait.

Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab.

Similar presentations


Presentation on theme: "Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab."— Presentation transcript:

1 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab

2 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Overview Introduction to information extraction SPPC Survey of information extraction approaches MUC conferences

3 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology What is Information Extraction The creation of a structured representation (TEMPLATE) of selected information drawn from natural language text Task is more limited than full text understanding: extracting relevant information, ignoring irrelevant information

4 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Example (Management Succession) 1.Dr. Hermann Wirth, bisheriger Leiter der Musikhochschule München, verabschiedete sich heute aus dem Amt. 2.Der 65jährige tritt seinen wohlverdienten Ruhestand an. 3.Als seine Nachfolgerin wurde Sabine Klinger benannt. 4.Ebenfalls neu besetzt wurde die Stelle des Musikdirektors. 5.Annelie Häfner folgt Christian Meindl nach. PersonOut: Dr. Hermann Wirth PersonIn: Sabine Klinger Position: Leiter Organization: Musikhochschule München TimeOut:Heute TimeIn: PersonOut: Christian Meindl PersonIn: Annelie Häfner Position: Musikdirektor Organization: ? TimeOut: TimeIn:

5 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Template filling rule NLP: VG: verabschiedet sich aus dem Amt Subject: [1] NP Domain: IN: OUT: PERSON: [1] PERSON: [2] ORGANISATION: [3] POSITION: [4] ORGANISATION: [3] POSITION: [4]

6 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Template filling rule NLP: VG: verabschiedete sich heute aus dem Amt Subject: [1] Domain:IN: OUT: PERSON: [1] PERSON: [2] ORGANISATION: [3] POSITION: [4] ORGANISATION: [3] POSITION: [4] PERSON_NAME: [5] POSITION_NAME: [4] ORGANISATION_NAME: [3]

7 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Traditional IE Architecture Tokenization Morphological and Lexical Processing Parsing Discourse Analysis Text Sectionizing and Filtering Part of Speech Tagging Word Sense Tagging Fragment Processing Fragment Combination Scenario Pattern Matching Coreference Resolution Inference Template Merging Local Text Analysis

8 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Introduction System for shallow analysis of German free text documents (toolset of integrated shallow components) based on shallow text processor of SMES (G. Neumann) exhaustive use of finite-state technology: - DFKI Finite-State Machine Toolkit - dynamic tries high linguistic coverage very good performance SPPC - SHALLOW PROCESSING PRODUCTION CENTER

9 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Architecture rund 60 bis 70 Prozent: percentage-NP bis: adv Steigerungsrate: steigerung+[s]+rate bis: prep|adv rund 60 bis 70 Prozent: NP der Steigerungsrate: NP ASCII Documents Tokenizer Lexical Processor POS-Filtering Named Entity Finder Phrase Recognizer Sentence Boundary Detection XML-Output Interface Linguistic Knowledge Pool XML Documents rund: lowercase 60: two-digit-integer EXAMPLE: rund 60 bis 70 Prozent der Steigerungsrate (about 60 to 70 percent increase) Text Chart

10 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Tokenizer The goal of the TOKENIZER is to: - map sequences of consecutive characters into word-like units (tokens) - identify the type of each token disregarding the context - performing word segmentation when necessary (e.g., splitting contractions into multiple tokens if necessary) overall more than 50 classes (proved to simplify processing on higher stages) - NUMBER_WORD_COMPOUND (69er) - ABBREVIATION ( CANDIDATE_FOR_ABBREVIATION), - COMPLEX_COMPOUND_FIRST_CAPITAL ( AT&T-Chief ) - COMPLEX_COMPOUND_FIRST_LOWER_DASH ( dItalia-Chefs-) represented as single WFSA (406 KB)

11 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Lexical Processor Tasks of the LEXICAL PROCESSOR: - retrieval of lexical information - recognition of compounds ( Autoradiozubehör - car-radio equipment ) - hyphen coordination ( Leder-, Glas-, Holz- und Kunstoffbranche leather, glass, wooden and synthetic materials industry) lexicon contains currently more than 700 000 German full-form words (tries) each reading represented as triple example: wagen ( to dare vs. a car ) STEM: wag INFL: (GENDER: m,CASE: nom, NUMBER: sg) (GENDER: m,CASE: akk, NUMBER: sg) (GENDER: m,CASE: dat, NUMBER: sg) (GENDER: m,CASE: nom, NUMBER: pl) (GENDER: m,CASE: akk, NUMBER: pl) (GENDER: m,CASE: dat, NUMBER: pl) (GENDER: m,CASE: gen, NUMBER: pl) POS: noun STEM: wag INFL: (FORM: infin) (TENSE: pres, PERSON: anrede, NUMBER: sg) (TENSE: pres, PERSON: anrede, NUMBER: pl) (TENSE: pres, PERSON: 1, NUMBER: pl) (TENSE: pres, PERSON: 3, NUMBER: pl) (TENSE: subjunct-1, PERSON: anrede, NUMBER: sg) (TENSE: subjunct-1, PERSON: anrede, NUMBER: pl) (TENSE: subjunct-1, PERSON: 1, NUMBER: pl) (TENSE: subjunct-1, PERSON: 3, NUMBER: pl) (FORM: imp, PERSON: anrede) POS: noun

12 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Part-of-Speech Filtering The task of POS FILTER is to filter out unplausible readings of ambiguous word forms large amount of German word forms are ambiguous (20% in test corpus) contextual filtering rules (ca. 100) - example: Sie bekannten, die bekannten Bilder gestohlen zu haben They confessed they have stolen the famous pictures bekannten - to confess vs. famous FILTERING RULE: if the previous word form is determiner and the next word form is a noun then filter out the verb reading of the current word form supplementary rules determined by Brills tagger in order to achieve broader coverage rules represented as FSTs, hard-coded rules (filtering out rare readings)

13 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Named Entity Recognition The task of NAMED ENTITY FINDER is the identification of: - entities: organizations, persons, locations - temporal expressions: time, date - quantities: monetary values, percentages, numbers identification of named entities in two steps: - recognition patterns expressed as WFSA are used to identify phrases containing potential candidates for named entities (longest match strategy) - additional constraints (depending on the type of candidate) are used for validating the candidates on-line base lexicon for geographical names, first names (persons)

14 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Named Entity Recognition since named entities may appear without designators (companies, persons) a dynamic lexicon for storing such named entities is used candidates for named entities: example: Da flüchten sich die einen ins Ausland, wie etwa der Münchner Strickwarenhersteller März GmbH oder der badische Strumpffabrikant Arlington Socks, GmbH. Ab kommendem Jahr strickt März knapp drei Viertel seiner Produktion in Ungarn. partial reference resolution (acronyms) resolution of type ambiguity using the dynamic lexicon: example: if an expression can be a person name or company name (Martin Marietta Corp.) then use type of last entry inserted into dynamic lexicon for making decision

15 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Phrase Recognition The task of FRAGMENT RECOGNIZER is to recognize nominal and prepositional phrases and verb groups example: [ NP Das Unternehmen] [ ORG Merck]] [ V geht] [ DATE im Herbst 1999] [ PP mit einem Viertel seines Kapitals] [ PP an die [ ORG Frankfurter Börse]] In autumn 1999 the company Merck will go public to the Frankfurt Stock Exchange with one fourth of its capital extraction patterns expressed as WFSAs (mainly based on POS information and named entity type) phrasal grammars only consider continuous substrings (recognition of verb groups is partial) example: Gestern [ ist ] der Linguist vom neuen Manager [ entlassen worden ]. Yesterday, the linguist [has] [been fired] by the new manager. Sentence Boundary Detection few simple contextual rules are sufficient since named entity recognition is performed earlier

16 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Evaluation & Performance RecallPrecision COMPOUND ANALYSIS:98.53%99.29% POS-FILTERING:74.50%96.36% NE-RECOGNITION:85% 95.77% FRAGMENTS (NPs,PPs):76.11%91.94% Performance : Evaluation : INPUT: corpus of German business magazine Wirtschaftswoche (1,2MB) TIME: ~15sec. ( ~13400 wrds/sec) on Pentium III, 850MHz, 256 RAM RECOGNIZED: 197118 tokens, 11575 named entities, 64839 Phrases Availability : C++ Library for UNIX, Windows 98/NT, Demo with GUI for Windows 98/NT

17 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Two Approaches to Building IE Systems [Appelt & Israel, 99] Knowledge Engineering Approach Grammars are constructed by hand Domain patterns are discovered by a human expert through introspection and inspection of a corpus Much laborious tuning Automatically Trainable Systems Use statistical methods when possible Learn rules from annotated corpora, e.g., –statistical name recognizer Learn rules from interaction with user, e.g., –learning template filler rules

18 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Knowledge Engineering [Appelt & Israel, 99] Advantages With skill and experience, good performing systems are not conceptually hard to develop The best performing systems have been hand crafted Disadvantages Very laborious development process Some changes to specifications can be hard to accommodate, e.g., –Name recognition rules based upper and lower cases Required experts may not be available

19 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Trainable Systems [Appelt & Israel, 99] Advantages Domain portability is relatively straightforward System experts is not required for customization Data driven rule acquisition ensures full coverage of examples Disadvantages Training data may not exist, and may be very expensive to acquire Large volume of training data may be required Changes to specifications may require reannotation of large quantities of training data

20 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology When Works Best for Knowledge based IE? [Appelt & Israel, 99] Resources (e.g. lexicons, lists) are available Rule writers are available Training data scarce or expensive to obtain A mixture of knowledge based and machine learning based approach is also possible!

21 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology FASTUS [Hobbs et al. 96] Acronym for Finite State Automaton Text Understanding System Developed in SRI International, Menlo Park, California Joined MUC-4 (92), MUC-5 (93), MUC-6 (95), MUC-7 (97) English and Japanese Inspired by the performance in MUC-3 that the group at the University of Massachusetts got out of a simple system [Lehnert et al., 1991] Pereiras work on finite-state approximations of grammars [Pereira, 1990] Works as a set of cascaded and nondeterministic finite-state automata

22 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Overview of the FASTUS Architecture [Hobbs et al. 96] Complex Words Text Basic Phrases Complex Phrases Merging Structures Domain Event Patterns Recognizing Phrases joint venture, set up, Bridgstone Sports Taiwan Co noun groups: a local concern verb groups: had set up appositive attachment: Peter, the director pp attachment: production of iron and wood

23 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Example of Phrase Recognition [Hobbs et al. 96] Salvadoran President-elect Afredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of crime Noun Group: Salvadoran President-elect Name: Alfredo Cristiani Verb Group:condemned Noun Group:the terrorist killing Preposition:of Noun Group:Attorney General Name:Roberto Gareia Alvarado Conjunction:and Verb Group:accused Noun Group:the Farabundo Marti National Liberation Front (FMLN) Preposition:of Noun Group:the crime

24 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Example of Domain Event Pattern Recognition [Hobbs et al. 96] [Salvadoran President-elect Afredo Cristiani] 2 condemned [the terrorist killing of Attorney General Roberto Garcia Alvarado] 1 and [accused the Farabundo Marti National Liberation Front (FMLN) of crime] 2 Two patterns are recognized: 1. killing of 2. accused of Two incident structures are constructed: Incident: KILLING Perpetrator:terrorist Confidence:__ Human Target: Roberto Garcia Alvarado Incident: KILLING Perpetrator:FMLN Confidence:Accused by Authorities Human Target: __

25 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology pseudo-syntax [Hobbs et al. 96] The material between the end of the subject noun group and the beginning of the main verb group must be read over, for example, Read over prepositional phrases and relative clauses 1.Subject {Preposition NounGroup}* VerbGroup 2.Subject Relpro {NounGroup | Other }* VerbGroup The mayor, who was kidnapped yesterday, was found dead today. Conjoined verb phrase, skipping over the first conjunct and associate the subject with the verb group in the second conjunct Subject VerbGroup {NounGroup|Other}* Conjunction VerbGroup

26 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Problem of pseudo-syntax [Hobbs et al. 96] Same semantic content can be realized in different forms. GM manufactures cars. Cars are manufactured by GM.... GM, which manufactures cars...... cars, which are manufactured by GM...... cars manufactured by GM... GM is to manufacture cars. Cars are to be manufactured by GM. GM is a car manufacturer. Question: How many rules are needed to extract all relevant patterns? Why not using a linguistic theory? Performance vs. Compotence TACITUS: 36 hours to process 100 Messages FASTUS: 12 minutes to process 100 messages

27 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Example of Template Merging [Hobbs et al. 96] Salvadoran President-elect Afredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of crime Two incident structures are merged as: Incident: KILLING Perpetrator:terrorist Confidence:__ Human Target: Roberto Garcia Alvarado Incident: KILLING Perpetrator:FMLN Confidence:Accused by Authorities Human Target: __ Incident: KILLING Perpetrator:FMLN Confidence:Accused by Authorities Human Target: Roberto Garcia Alvarado

28 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Message Understanding Conferences [MUC-7 98] U.S. Government sponsored conferences with the intention to coordinate multiple research groups seeking to improve IE and IR technologies (since 1987) defined several generic types of information extraction tasks (MUC Competition) MUC 1-2 focused on automated analysis of military messages containing textual information MUC 3-7 focused on information extraction from newswire articles terrorist events international joint-ventures management succession event

29 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Evaluation of IE systems in MUC Participants receive description of the scenario along with the annotated training corpus in order to adapt their systems to the new scenario (1 to 6 months) Participants receive new set of documents (test corpus) and use their systems to extract information from these documents and return the results to the conference organizer The results are compared to the manually filled set of templates (answer key)

30 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Evaluation of IE systems in MUC precision and recall measures were adopted from the information retrieval research community Sometimes an F-meassure is used as a combined recall-precision score

31 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Generic IE tasks for MUC-7 (NE) Named Entity Recognition Task requires the identification an classification of named entities organizations locations persons dates, times, percentages and monetary expressions (TE) Template Element Task requires the filling of small scale templates for specified classes of entities in the texts Attributes of entities are slot fills (identifying the entities beyond the name level) Example: Persons with slots such as name (plus name variants), title, nationality, description as supplied in the text, and subtype. Capitan Denis Gillespie, the comander of Carrier Air Wing 11

32 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Generic IE tasks for MUC-7 (TR) Template Relation Task requires filling a two slot template representing a binary relation with pointers to template elements standing in the relation, which were previously identified in the TE task subsidiary relationship between two companies (employee_of, product_of, location_of)

33 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology (CO) Coreference Resolution requires the identification of expressions in the text that refer to the same object, set or activity variant forms of name expressions definite noun phrases and their antecedents pronouns and their antecedents The U.K. satellite television broadcaster said its subscriber base grew 17.5 percent during the past year to 5.35 million bridge between NE task and TE task Generic IE tasks for MUC-7

34 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology (ST) Scenario Template requires filling a template structure with extracted information involving several relations or events of interest intended to be the MUC approximation to a real-world information extraction problem identification of partners, products, profits and capitalization of joint ventures Generic IE tasks for MUC-7

35 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Tasks evaluated in MUC 3-7 [Chinchor, 98] EVAL\TASKNECORETRST MUC-3YES MUC-4YES MUC-5YES MUC-6YES MUC-7YES

36 Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Maximum Results Reported in MUC-7 MEASSURE\TASKNECOTETRST RECALL9256866742 PRECISION9569878665


Download ppt "Source: Feiyu Xu, Jakub Piskorski 2002 Language Technology Information Extraction Feiyu Xu Jakub Piskorski DFKI LT-Lab."

Similar presentations


Ads by Google