Introduction to Information Extraction

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

User Interface Structure Design
Chapter 5: Introduction to Information Retrieval
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Information Extraction CS 652 Information Extraction and Integration.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan
Information Extraction CS 652 Information Extraction and Integration.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
Machine Learning for Information Extraction Li Xu.
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
PowerPoint Presentation for Dennis, Wixom & Tegarden Systems Analysis and Design Copyright 2001 © John Wiley & Sons, Inc. All rights reserved. Slide 1.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
Information Extraction from HTML: General Machine Learning Approach Using SRV.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Erasmus University Rotterdam Introduction With the vast amount of information available on the Web, there is an increasing need to structure Web data in.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
PowerPoint Presentation for Dennis, Wixom & Tegarden Systems Analysis and Design Copyright 2001 © John Wiley & Sons, Inc. All rights reserved. Slide 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 The BT Digital Library A case study in intelligent content management Paul Warren
ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
ITTL.ppt-1 Information Technology & Telecommunications Laboratory Semantic Technologies Applied to FOIA Review William Underwood Partnerships in Innovation:
User Interface Structure Design Chapter 11. Key Definitions The user interface defines how the system will interact with external entities The system.
Slide 1 Chapter 11 User Interface Structure Design Chapter 11 Alan Dennis, Barbara Wixom, and David Tegarden John Wiley & Sons, Inc. Slides by Fred Niederman.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Presenter: Shanshan Lu 03/04/2010
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Knowledge Discovery for a Focused Domain Scanning of documents and messages of interest to a business and the extraction of relevant facts for knowledge.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information.
Information Retrieval
Information Extraction from Single and Multiple Sentences Mark Stevenson Department of Computer Science University of Sheffield, UK.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
LaSIE: The Large Scale Information Extraction System Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
WIKT 2007Košice, november Tvorba sémantických metadát Michal Laclavík Ústav Informatiky SAV.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Report Writing Lecturer: Mrs Shadha Abbas جامعة كربلاء كلية العلوم الطبية التطبيقية قسم الصحة البيئية University of Kerbala College of Applied Medical.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Requirements Specification
Creating Skills-Based Job Postings: An Overview Guide
CRF &SVM in Medication Extraction
Creating a Free Intranet Using Drupal™
Web Information Extraction
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Social Knowledge Mining
Systems Analysis and Design
A Machine Learning Approach to Coreference Resolution of Noun Phrases
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
Plain Text Information Extraction (based on Machine Learning)
A Machine Learning Approach to Coreference Resolution of Noun Phrases
Using Uneven Margins SVM and Perceptron for IE
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan chia@csie.ncu.edu.tw

Problem Definition The output template of the IE task Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. The output template of the IE task Several fields (slots) Several instances of a field

Difficulties of IE tasks depends on … Text type From Wall Street Journal articles, or email message, to HTML documents. Domain From financial news, or tourist information, to various language. Scenario

Various IE Tasks Free-text IE: Semi-structured IE: For MUC (Message Understanding Conference) E.g. terrorist activities, corporate joint ventures Semi-structured IE: E.g.: meta-search engines, shopping agents, Bio-integration system

Types of IE from MUC Named Entity recognition (NE) Finds and classifies names, places, etc. Coreference Resolution (CO) Identifies identity relations between entities in texts. Template Element construction (TE) Adds descriptive information to NE results. Scenario Template production (ST) Fits TE results into specified event scenarios.

Name Entity Recognition http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book_3.html

NE Recognition (Cont.) Spanish: 93% Japanese: 92% Chinese: 84.51%

Coreference Resolution Coreference resolution (CO) involves identifying identity relations between entities in texts. For example, in Alas, poor Yorick, I knew him well. Tie “Yorick" with “him“. The Sheffield system scored 51% recall and 71% precision. http://www.cs.nyu.edu/cs/faculty/grishman/COtask21.book_4.html

Template Element Production Adds description with named entities Sheffield system scores 71%

Scenario Template Extraction STs are the prototypical outputs of IE systems They tie together TE entities into event and relation descriptions. Performance for Sheffield: 49% http://www.cs.nyu.edu/cs/ faculty/grishman/ IEtask15.book_2.html

Example The operational domains that user interests are centered around are drug enforcement, money laundering, organized crime, terrorism, …. 1. Input: texts dealing with drug enforcement, money laundering, organized crime, terrorism, and legislation; 2. NE: recognizes entities in those texts and assigns them to one of a number of categories drawn from the set of entities of interest (person, company, . . . ); 3. TE: associates certain types of descriptive information with these entities, e.g. the location of companies; 4. ST: identifies a set (relatively small to begin with) of events of interest by tying entities together into event relations.

Example Text

Output Example (NE, TE)

Output (STs)

Another IE Example Corporate Management Changes Purpose which positions in which organizations are changing hands? who is leaving a position and where the person is going to? who is appointed to a position and where the person is coming from? the locations and types of the organizations involved in the succession events; the names and titles of the persons involved in the succession events http://www.cs.umanitoba.ca/~lindek/ie-ex.htm

Input Text President Clinton nominated John Rollwagen, the chairman and CEO of Cray Research Inc., as the No. 2 Commerce Department official. Mr. Rollwagen said he wants to push the Clinton administration to aggressively confront U.S. trading partners such as Japan to open their markets, particularly for high-tech industries. In a letter sent throughout the Eagan, Minn.-based company on Friday, Mr. Rollwagen warned: "Whether we like it or not, our country is in an economic war; and we are at a key turning point in that war." ...... Cray said it has appointed John F. Carlson, its president and chief operating officer, to succeed him. ......

Extraction Result Corporate Management Database Person Organization Position Transition John Rollwagen Cray Research Inc. chairman out CEO John F. Carlson in Organization Database Name Location Alias Type Cray Research Inc. Eagan, Minn. Cray COMPANY Commerce Department GOVERNMENT

MUC Data Set for MET2 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/met2/met2package.tar.gz MUC3&4 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/muc_data/muc34.tar.gz MUC6&7 from LDC http://www.ldc.upenn.edu/ MUC-6: http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html MUC-7 http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ proceedings/muc_7_toc.html

Summary Evaluation Design Methodology Precision= Recall= Natural Language Processing Machine Learning # of correctly extracted fields # of extracted fields # of correctly extracted fields # of fields to be extracted

IE from Semi-structured Documents Output Template: k-tuple Multiple instances of a field Missing data

Various IE Tasks for Semi-structured Documents Multiple-record page extraction One-record (singular) page extraction

Multiple-record page extraction

One-record (singular) page extraction

Summary Evaluation Design Methodology Precision= Recall= Machine Learning Pattern Mining # of correctly extracted records # of extracted records # of correctly extracted records # of records to be extracted

News Group IE Example: Computer-Related Jobs

Output Template Between free-text IE and semi-structured IE [CaliffRapier 99]

Annotated Training Examples Most systems require annotated training examples (answer keys) AutoSlog, Rapier, SRV, WIEN, Softmealy, Stalker Very few systems require unannotated training examples AutoSlog-TS, IEPAD, OLERA

The Type of Extraction Rule Delimiter-based Rule WIEN, Stalker Content-based Rule Context-based Rule Rapier, AutoSlog, SRV, IEPAD

Background Knowledge For Rule Generalization Example Implicit or Explicit Example Specified format for date, email, etc. Special feature for color, location, etc.

Conclusion Define the IE problem Specify the input: training example with annotation, or without annotation Depict the extraction rule Use necessary background knowledge

References *H. Cunningham, Information Extraction – a User Guide, http://www.dcs.shef.ac.uk *MUC-6, http://www.cs.nyu.edu/cs/faculty/ grishman/muc6.html *I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction. Califf, Relational Learning of Pattern-Matching Rule for Information Extraction, AAAI-99.