©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) 270-6851 Fall, 2003.

Slides:



Advertisements
Similar presentations
Tag-Questions or Question Tags
Advertisements

YES / NO QUESTIONS Pamela Sue Rohring EDD 537 – Language Theories and Strategies II February 22, 2001.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Grammar and Sentences “It is impossible ..to teach English grammar in the schools for the simple reason that no one knows exactly what it is” Government.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
CSC 4630 Meeting 9 February 14, 2007 Valentine’s Day; Snow Day.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
Domain model: visualizing concepts
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Grammar Nuha Alwadaani.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Chapter 9 Domain Models. Domain Model in UML Class Diagram Notation A “visual dictionary”
Grammar and Composition Review
Grammar Skills Workshop
Welcome Orientation. Introduction to the Course Course Objectives By the end of this course students will be able to: · Master the grammatical uses and.
COMPOSITION 9 Parts of Speech: Verbs Action Verbs in General  Follow along on Text page 362.  A verb either expresses an action (what something or.
Survey of Semantic Annotation Platforms
Information Extraction From Medical Records by Alexander Barsky.
©2002 Paula Matuszek iMiner Introduction. ©2002 Paula Matuszek iMiner from IBM l Text Mining tool with multiple components l Text Analysis tools includ.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Language Learning Targets based on CLIMB standards.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
CS370 Spring 2007 CS 370 Database Systems Lecture 1 Overview of Database Systems.
Extract Questions from Sentences. Purpose The behavior of extracting questions from sentences can be regarded as extracting semantics or knowledge from.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Washington's Farewell Address
Domain Model—Part 2: Attributes.  A logical data value of an object  (Text, p. 158)  In a domain model, attributes and their data types should be simple,
©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Lexico-semantic Patterns for Information Extraction from Text The International Conference on Operations Research 2013 (OR 2013) Frederik Hogenboom
Blog Summarization We have built a blog summarization system to assist people in getting opinions from the blogs. After identifying topic-relevant sentences,
CSC 9010 Spring, Paula Matuszek. 1 CS 9010: Semantic Web Applications and Ontology Engineering Paula Matuszek Spring, 2006.
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Domain Model A representation of real-world conceptual classes in a problem domain. The core of object-oriented analysis They are NOT software objects.
Macmillan IELTS Improving students' answers in IELTS Writing Task 1 Sam McCarter.
Subject-Verb Agreement & Parallel Structure
©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
OO DomainModeling With UML Class Diagrams and CRC Cards Chapter 6 Princess Nourah bint Abdulrahman University College of Computer and Information Sciences.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
DOMAIN MODEL—PART 2: ATTRIBUTES BTS430 Systems Analysis and Design using UML.
Auxiliaries in simple past How to work with “did” and “was-were”
©2003 Paula Matuszek CSC 9010: AeroText, Ontologies, AeroDAML Dr. Paula Matuszek (610)
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
G.L. 4 - Action Verbs Action verbs tell what the subject is or does. A sentence can contain more than one action verb. Action verbs make writing more vivid.
OO Domain Modeling With UML Class Diagrams and CRC Cards
Syntax.
Parts of Speech Review Commas
Parts of Speech Review Commas
©2004 Pearson Education, Inc., publishing as Longman Publishers.
PREPOSITIONAL PHRASES
Washington's Farewell Address
Information Retrieval
Presentation transcript:

©2003 Paula Matuszek CSC 9010: Information Extraction Dr. Paula Matuszek (610) Fall, 2003

©2003 Paula Matuszek Information Extraction Overview l Given a body of text: extract from it some well-defined set of information l MUC conferences l Typically draws heavily on NLP l Three main components: –Domain knowledge base –Knowledge model –Extraction Engine

©2003 Paula Matuszek Information Extraction Domain Knowledge Base l Terms: enumerated list of strings which are all members of some class. –“January”, “February” –“Smith”, “Wong”, “Martinez”, “Matuszek” –“”lysine”, “alanine”, “cysteine” l Classes: general categories of terms –Monthnames, Last Names, Amino acids –Capitalized nouns –Verb Phrases

©2003 Paula Matuszek Domain Knowledge Base l Rules: LHS, RHS, salience l Left Hand Side (LHS): a pattern to be matched, written as relationships among terms and classes l Right Hand Side (RHS): an action to be taken when the pattern is found l Salience: priority of this rule (weight, strength, confidence)

©2003 Paula Matuszek Some Rule Examples: l => l => print “Birthdate”,, l => create address database record l “/” “/” => create date database record (50) l “/” “/” => create date database record (60) l “.” => l => create “relationship” database record

©2003 Paula Matuszek Generic KB l Generic KB: KB likely to be useful in many domains –names –dates –places –organizations l Almost all systems have one l Limited by cost of development: it takes about 200 rules to define dates reasonably well, for instance.

©2003 Paula Matuszek Domain-specific KB l We mostly can’t afford to build a KB for the entire world. l However, most applications are fairly domain-specific. l Therefore we build domain-specific KBs which identify the kind of information we are interested in. –Protein-protein interactions –airline flights –terrorist activities

©2003 Paula Matuszek Domain-specific KBs l Typically start with the generic KBs l Add terminology l Figure out what kinds of information you want to extract l Add rules to identify it l Test against documents which have been human-scored to determine precision and recall for individual items.

©2003 Paula Matuszek Knowledge Model l We aren’t looking for documents, we are looking for information. What information? l Typically we have a knowledge model or schema which identifies the information components we want and their relationship l Typically looks very much like a DB schema or object definition

©2003 Paula Matuszek Knowledge Model Examples l Personal records –Name –First name –Middle Initial –Last Name –Birthdate –Month –Day –Year –Address

©2003 Paula Matuszek Knowledge Model Examples l Protein Inhibitors –Protein name (class?) –Compound name (class?) –Pointer to source –Cache of text –Offset into text

©2003 Paula Matuszek Knowledge Model Examples l Airline Flight Record –Airline –Flight l Number l Origin l Destination l Date »Status »departure time »arrival time

©2003 Paula Matuszek Extraction Engine l Tool which applies rules to text and extracts matches –Tokenizer (no stemming or stop words) –Part of Speech (POS) Tagger –Term and class tagger –Rule engine: match LHS, execute RHS l Rule engine is iterative l Rule conflict resolution –Salience –Packages

©2003 Paula Matuszek Extraction Example: Birthdates l Problem: create a database of birthdays from text with birth information l Sample sentences: –George Washington was born in –Washington was born on Feb. 12, –Feb. 12 is Washington's birthday. –Washington's birth date is Feb. 12, –George Washington was born in America. –Washington's standard was born by his troops in Examples Negative Examples

©2003 Paula Matuszek Birthdates: Knowledge Model l Simple birthdate model: –Name –Birthdate l Complex birthdate model: –Name– Date –First Name – Month –Middle Name – Day –Last Name – Year

©2003 Paula Matuszek Birthdates Knowledge Base l Generic KB: Name, Date l Domain specific KB: Rules –1. "was born" {"in"|"on"} =>Insert (Name, Date) into database –2. "is" "birthday" =>Insert (Name, Date) into database –3. "birth" "date" "is" =>Insert (Name, Date) into database

©2003 Paula Matuszek Birthdays: Extraction Process Washington was born in 1725 l Tokenize: –"Washington" –"was" –"born" –"in" –"1725" –"."

©2003 Paula Matuszek Extraction, POS Tagging l "Washington", noun, proper noun, subject l "was": auxiliary verb, past tense, third person singular (3PS) l "born": verb, past tense, 3PS l "was born": verb phrase, passive l "in": preposition l "1725": prepositional object l "in 1725" prepositional phrase

©2003 Paula Matuszek Extraction, Class Tagging l "Washington": Last Name l "was": nothing additional l "born": nothing additional l "in": nothing additional l "1725": Year

©2003 Paula Matuszek Extraction: Rules l Name Rules: –"Washington": Name l Date Rules: –"1725": Date l Birthday Rule # 1: –Insert (Washington, 1725) into database

©2003 Paula Matuszek Summary l Text mining below the document level l NOT typically interactive, because it’s slow (1 to 100 meg of text/hr) l Typically builds up a DB of information which can then be queried l Uses a combination of term- and rule- driven analysis and Natural Language Processing parsing.