Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
An Introduction for Educational Agents Connected is an online community for British Boarding Schools and educational agents. Connected is designed to enable.
User Interface Structure Design
Chapter 5: Introduction to Information Retrieval
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
Information Extraction CS 652 Information Extraction and Integration.
Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
Software Testing and Quality Assurance
Information Retrieval and Extraction -- Course Introduction Chia-Hui Chang National Central University
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan
Information Extraction CS 652 Information Extraction and Integration.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
Machine Learning for Information Extraction Li Xu.
PowerPoint Presentation for Dennis, Wixom & Tegarden Systems Analysis and Design Copyright 2001 © John Wiley & Sons, Inc. All rights reserved. Slide 1.
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
Information Extraction from HTML: General Machine Learning Approach Using SRV.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Introduction to Machine Learning Approach Lecture 5.
Introduction to machine learning
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Dr. Alireza Isfandyari-Moghaddam Department of Library and Information Studies, Islamic Azad University, Hamedan Branch
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
PowerPoint Presentation for Dennis, Wixom & Tegarden Systems Analysis and Design Copyright 2001 © John Wiley & Sons, Inc. All rights reserved. Slide 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
ITTL.ppt-1 Information Technology & Telecommunications Laboratory Semantic Technologies Applied to FOIA Review William Underwood Partnerships in Innovation:
User Interface Structure Design Chapter 11. Key Definitions The user interface defines how the system will interact with external entities The system.
Slide 1 Chapter 11 User Interface Structure Design Chapter 11 Alan Dennis, Barbara Wixom, and David Tegarden John Wiley & Sons, Inc. Slides by Fred Niederman.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Knowledge Discovery for a Focused Domain Scanning of documents and messages of interest to a business and the extraction of relevant facts for knowledge.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Information Retrieval
Information Extraction from Single and Multiple Sentences Mark Stevenson Department of Computer Science University of Sheffield, UK.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
THE MARKETING RESEARCH PROCESS CHAPTER 29 Mrs. Simone Seaton Marketing Management.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
LaSIE: The Large Scale Information Extraction System Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
WIKT 2007Košice, november Tvorba sémantických metadát Michal Laclavík Ústav Informatiky SAV.
Report Writing Lecturer: Mrs Shadha Abbas جامعة كربلاء كلية العلوم الطبية التطبيقية قسم الصحة البيئية University of Kerbala College of Applied Medical.
Introduction to Information Extraction
Social Knowledge Mining
A Machine Learning Approach to Coreference Resolution of Noun Phrases
A Machine Learning Approach to Coreference Resolution of Noun Phrases
Using Uneven Margins SVM and Perceptron for IE
Presentation transcript:

Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan

2 Problem Definition Information Extraction (IE) is to identify relevant information from documents, pulling information from a variety of sources and aggregates it into a homogeneous form. The output template of the IE task  Several fields (slots)  Several instances of a field

3 Difficulties of IE tasks depends on … Text type  From Wall Street Journal articles, or message, to HTML documents. Domain  From financial news, or tourist information, to various language. Scenario

4 Various IE Tasks Free-text IE:  For MUC (Message Understanding Conference)  E.g. terrorist activities, corporate joint ventures Semi-structured IE:  E.g.: meta-search engines, shopping agents, Bio-integration system

5 Types of IE from MUC Named Entity recognition (NE)  Finds and classifies names, places, etc. Coreference Resolution (CO)  Identifies identity relations between entities in texts. Template Element construction (TE)  Adds descriptive information to NE results. Scenario Template production (ST)  Fits TE results into specified event scenarios.

6 Name Entity Recognition _3.html

7 NE Recognition (Cont.) Spanish: 93% Japanese: 92% Chinese: 84.51%

8 Coreference Resolution Coreference resolution (CO) involves identifying identity relations between entities in texts. For example, in Alas, poor Yorick, I knew him well. Tie “ Yorick" with “ him “. The Sheffield system scored 51% recall and 71% precision. k_4.html k_4.html

9 Template Element Production Adds description with named entities Sheffield system scores 71%

10 Scenario Template Extraction STs are the prototypical outputs of IE systems They tie together TE entities into event and relation descriptions. Performance for Sheffield: 49% faculty/grishman/ IEtask15.book_2.html

11 Example The operational domains that user interests are centered around are drug enforcement, money laundering, organized crime, terrorism, …. 1. Input: texts dealing with drug enforcement, money laundering, organized crime, terrorism, and legislation; 2. NE: recognizes entities in those texts and assigns them to one of a number of categories drawn from the set of entities of interest (person, company,... ); 3. TE: associates certain types of descriptive information with these entities, e.g. the location of companies; 4. ST: identifies a set (relatively small to begin with) of events of interest by tying entities together into event relations.

12 Example Text

13 Output Example (NE, TE)

14 Output (STs)

15 Another IE Example Corporate Management Changes Purpose  which positions in which organizations are changing hands?  who is leaving a position and where the person is going to?  who is appointed to a position and where the person is coming from?  the locations and types of the organizations involved in the succession events;  the names and titles of the persons involved in the succession events

16 Input Text President Clinton nominated John Rollwagen, the chairman and CEO of Cray Research Inc., as the No. 2 Commerce Department official. Mr. Rollwagen said he wants to push the Clinton administration to aggressively confront U.S. trading partners such as Japan to open their markets, particularly for high-tech industries. In a letter sent throughout the Eagan, Minn.-based company on Friday, Mr. Rollwagen warned: "Whether we like it or not, our country is in an economic war; and we are at a key turning point in that war." Cray said it has appointed John F. Carlson, its president and chief operating officer, to succeed him

17 Extraction Result Corporate Management Database PersonOrganizationPositionTransition John RollwagenCray Research Inc.chairmanout John RollwagenCray Research Inc.CEOout John F. CarlsonCray Research Inc.chairmanin John F. CarlsonCray Research Inc.CEOin Organization Database NameLocationAliasType Cray Research Inc.Eagan, Minn.CrayCOMPANY Commerce DepartmentGOVERNMENT

18 MUC Data Set for  MET2 uc/met2/met2package.tar.gz MET2  MUC3&4 uc/muc_data/muc34.tar.gz MUC3&4  MUC6&7 from LDC MUC-6: MUC-6 MUC-7 proceedings/muc_7_toc.html

19 Summary Evaluation  Precision=  Recall= Design Methodology  Natural Language Processing  Machine Learning # of correctly extracted fields # of extracted fields # of correctly extracted fields # of fields to be extracted

20 IE from Semi-structured Documents Output Template: k-tuple  Multiple instances of a field  Missing data

21 Various IE Tasks for Semi-structured Documents Multiple-record page extraction One-record (singular) page extraction

Multiple-record page extraction

One-record (singular) page extraction

24 Summary Evaluation  Precision=  Recall= Design Methodology  Machine Learning  Pattern Mining # of correctly extracted records # of extracted records # of correctly extracted records # of records to be extracted

25 News Group IE Example: Computer-Related Jobs

26 Output Template Between free-text IE and semi-structured IE [CaliffRapier 99]

27 Annotated Training Examples Most systems require annotated training examples (answer keys)  AutoSlog, Rapier, SRV, WIEN, Softmealy, Stalker Very few systems require unannotated training examples  AutoSlog-TS, IEPAD, OLERA

28 The Type of Extraction Rule Delimiter-based Rule  WIEN, Stalker Content-based Rule Context-based Rule  Rapier, AutoSlog, SRV, IEPAD

29 Background Knowledge For Rule Generalization  Implicit or Explicit Example  Specified format for date, , etc.  Special feature for color, location, etc.

30 Conclusion Define the IE problem Specify the input: training example  with annotation, or  without annotation Depict the extraction rule  Use necessary background knowledge

31 References *H. Cunningham, Information Extraction – a User Guide, *MUC-6, grishman/muc6.htmlhttp:// grishman/muc6.html *I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.Extraction Patterns for Information Extraction Tasks: A Survey Califf, Relational Learning of Pattern-Matching Rule for Information Extraction, AAAI-99.