Information Extraction: What It Is How to Do It Where It’s Going Douglas E. Appelt Artificial Intelligence Center SRI International.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Information Retrieval in Practice
Information Extraction CS 652 Information Extraction and Integration.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Information Extraction CS 652 Information Extraction and Integration.
Artificial Intelligence
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Retrieval in Practice
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Understanding Naturally Conveyed Explanations of Device Behavior Michael Oltmans and Randall Davis MIT Artificial Intelligence Lab.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
MDD-Kurs / MDA Cortex Brainware Consulting & Training GmbH Copyright © 2007 Cortex Brainware GmbH Bild 1Ver.: 1.0 How does intelligent functionality implemented.
 System Requirement Specification and System Planning.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
CSC 594 Topics in AI – Natural Language Processing
PRESENTED BY: PEAR A BHUIYAN
Introduction Characteristics Advantages Limitations
Text Based Information Retrieval
Natural Language Processing (NLP)
Social Knowledge Mining
Machine Learning in Natural Language Processing
CS4705 Natural Language Processing
CS246: Information Retrieval
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

Information Extraction: What It Is How to Do It Where It’s Going Douglas E. Appelt Artificial Intelligence Center SRI International

Some URLs to Visit l » ANLP-97 tutorial on information extraction » Many WWW links –Research sites and literature –Resources for building systems l » An IE System for Power PC Macintoshes » Uses TIPSTER technology –TIPSTER architecture –Common Pattern Specification Language –It’s free » Comes with a complete English name recognizer

Information Extraction: Situating IE l Text Manipulation: grep l Information Retrieval l Information Extraction l Text Understanding

Text Understanding l No predetermined specification of semantic or communicative areas of interest l No clearly defined criteria of success l Representation of meaning must be sufficiently general to capture all of the meaning of the text and the author’s intentions.

Information Extraction l Information of interest is delimited and pre-specified l Fixed, predefined representation of information l Clear criteria of success are at least possible l Corollary Features » Small portion of text is relevant » Often, only a portion of a relevant sentence is relevant » Targeted at relatively large corpora

Applications l Information Retrieval (routing queries) l Indexing for Information Retrieval l Filter for IR Output l Direct Presentation to the User: highlighting l Summarization l Construction of data bases and knowledge bases

Evaluation Metrics l MUC Evaluations l Precision and Recall » Recall: percentage possible found » Precision: percentage provided that is correct » F-measure: weighted, geometric mean of recall and precision l Is there a F-60 barrier?

A Bare Bones Extraction System Tokenizer Morphological and Lexical Processing Parsing Domain Semantics

Flesh for the Bones Tokenizer Morphological and Lexical Processing Parsing Domain Semantics Text Sectionizing And Filtering Part of Speech and Word Sense Tagging Coreference Merging Partial Results

The IE Approach - KISS l Keep it Simple, Stupid » Finite-state language models » Fragment processing » Simple semantics –Propositional –Small number of propositions –Often represented by templates l Use heuristics » Missing Information » Make favorable recall/precision tradeoffs

Two Approaches to Extraction Systems l Knowledge Engineering Approach » Grammars constructed by hand » Domain patterns discovered by introspection and corpus examination » Laborious tuning and hill-climbing l Learning and Statistical Approach » Apply statistical methods where possible » Learn rules from annotated corpora » Learn from interaction with user

Knowledge Engineering Approach l Advantages » Skilled computational linguists can build good systems quickly » Best performing practical systems have so far been handcrafted. l Disadvantages » Very laborious development process » Difficult to port systems to new domains » Requires expertise

Learning-Statistical Approach l Advantages » Domain portability is straightforward » Minimal expertise required for customization » Rule acquisition is data driven - complete coverage of examples l Disadvantages » Training data may not exist and may be difficult or expensive to obtain » Highest performing systems are still hand-crafted

A Combined Approach l Use statistical methods on modules where training data exists, and high accuracy can be achieved » Part-of-speech tagging » Name recognition » Coreference l Use knowledge engineering when training data is sparse and human ingenuity required » Domain Processing

Lexical Processing: Named Entity Recognition l Named Entities are targets of extraction in many domains » Companies » Other organizations » People » Locations » Dates, times, currency l Impossible or impractical to list all possible named entities in a lexicon

The List Fallacy l Comprehensive lexical resources do not necessarily result in improved extraction performance » Some entities are so new they’re not on any lists » Rare senses cause problems - “has been” as a noun » Names often overlap with other names and ordinary words –“Dallas” can be the name of a person –“Dollar” is the name of a town l Solutions » Part-of-speech tagging » Recognition from context

Knowledge Engineering vs. Statistical Models l Knowledge Engineering » SRI, SRA, Isoquest » Performance –1996: F –1998: F l Statistical Models » BBN, NYU (1998) » Performance –1997: F 93 –1998: F Hand-coding reduces the error rate by 50%.

Knowledge Engineering Name Recognition l Identify some names explicitly in lexicon l Identify parts of names with lexical features l Write rules that recognize names » Use capitalization in English » Recognize names based on internal structure –“Mumble Mumble City” Location –“Mumble Mumble GmbH” Company » Exceptions for common “gotchas” –“Yesterday IBM announced…” –“General Electric” is a company, not a general l Many complex rules are the result

Statistical Model Name Recognition Hidden Markov Models Start End Name Not-a-name

Statistical Model Name Recognition l Transitions are probabilistic l Training » Annotate a corpus » Estimate transition probabilities given words (and/or their features) » P(s i | s i-1, w i ) l Application » Compute the maximum-likelihood path through the network for the input text. » The Viterbi algorithm

Training Data l The amount of data needed is not onerous (diminishing returns at 100,000 words) l Annotation can be done by non-linguist native speakers l Training also works (with some degradation) for upper-case-only and punctuation-free texts.

Interesting Aside l NYU trained a statistical model using as word “features” whether various other name recognition systems tagged that word as part of a name. l Result: Better than human performance! » System achieved F » Experienced humans: F

Parsing in IE Systems l Some IE Systems have attempted full parsing » NYU Pre-1996 Proteus System » SRI Tacitus System l Attempts to adapt to the IE task » Fragment interpretation » Limitation of search l Statistical Parsing? » No real systems exist yet

Problems with Full Parsing l The search space becomes prohibitively large for long sentences. » The system is slow. Rapid development and testing of rules becomes impossible. l “Full Parse” heuristic » It is often possible with a comprehensive grammar to span the sentence with a highly improbable parse when the actual analysis is outside of the grammar, or lost in the search space.

The IE Approach to Parsing l Analyze sentences as simple constituents that can be described with a finite-state grammar » Noun Groups, Verb Groups, Particles » Ignore prepositional attachment » Ignore clause boundaries l Parser consists of one or more finite-state transducers mapping words into simple constituents

A Finite-State Fragment Parse A. C. Nielson Co. NG said VG George Garrick, NG 40 years old, president NG of Information Resources, Inc. NG 's London-based European Information Services operation NG will become VG president NG and chief operating officer NG of Nielson Marketing Research USA NG a unit NG of Dun & Bradstreet. NG

Handling Difficult Cases l Relative Clauses » Use nondeterminism to connect single subject to multiple clauses l VP Conjunction » Use nondeterminism to connect single subject to multiple verb phrases l Appositives » Handle only domain-relevant cases l Prepositional Attachment » Handle only domain-relevant cases

An Application Domain l Identify domain-relevant objects l Identify properties of those objects l Identify relationships among domain-relevant objects l Identify relevant events involving domain objects

The Molecular Approach l Standard approach l High precision, low recall approach » Read texts » Identify common, domain relevant patterns signaling properties, events, and relationships » Build rules to cover those » Move to less frequent, less reliable patterns

The Atomic Approach l Aims for high recall, low precision » Determine features of application-relevant entity types » Determine features of application-relevant event and relation types » Every occurrence of a phrase with the relevant feature triggers a candidate event/relation » Merge candidate relations to obtain more fully instantiated event/relation descriptions » Filter using application-specific criteria

Appropriateness l Appropriate when » Relevant entities have easily determined types » Only one or a small number of relations can hold of an entity with a given type » Relevant events and relations are symmetrical. l Examples » Labor negotiations » MUC-5 Microelectronics l Heavy reliance on merging of partial information (even within sentence)

Is There a Barrier?

Where is the Upper Bound? l Experience suggests that, for a MUC- like task with MUC scoring, it is unrealistic to expect to achieve more than about F 65 on a blind test. (F 70 on training data) l About 75% of human performance.

Reasons for the Limits l There is a long tail of increasingly rare domain-relevant expressions l A barrier of inherently hard linguistic phenomena » Complex coordination » Collective-distributive reference » Multiple interacting phenomena in the same sentence » Hard inferences required l Limits of heuristic tradeoffs are reached

Improve Information Retrieval l Routing task: » Build a quick extraction system for a topic. » IR system picks 2000 texts » Rescore by using extraction system to evaluate the text for relevance » Return the 1000 top texts l Results: 12 improve, 4 same, 5 worse l Best results when training data is sparse l More testing and evaluation needed.

Topic Oriented Summarization l Extract information of interest l Generate NL summary of extracted data l Generation can be in a different language, enabling cross-language access to key information.

Process Many Documents Quickly l Exploit redundancy in corpora to get higher recall from merging of multiple descriptions of the same event. » Analyze data from multiple news feeds l Annotating text for training language models » Need to identify names in speech (broadcast news) » Train class bigram on 100 million words of training data. » Because automatic name annotation is almost as good as human annotation, automatic annotation of training data is feasible.

Make Limits More Quickly Attainable l Automatic learning of rules from examples l Application of "open domain" extraction systems » Build general rules for a very broad domain, like "business and economic news" » Quickly customize rules from library for a specific application » Used a prototype to generate extraction systems for routing queries in a half-day.