Information Extraction Yunyao Li EECS /SI 767 03/29/2006.

Slides:

Advertisements

Similar presentations

Automatic Speech Recognition II  Hidden Markov Models  Neural Network.

Advertisements

John Lafferty, Andrew McCallum, Fernando Pereira

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수

Statistical NLP: Lecture 11

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

1 I256: Applied Natural Language Processing Marti Hearst Nov 15, 2006.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Named Entity Recognition.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Conditional Random Fields

Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.

. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.

6/29/2015 2:01 AM1 CSE 573 Finite State Machines for Information Extraction Topics –Administrivia –Background –DIRT –Finite State Machine Overview –HMMs.

Hidden Markov Models David Meir Blei November 1, 1999.

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Information Extraction

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Introduction to Text Mining

School of Engineering and Computer Science Victoria University of Wellington COMP423 Intelligent agents.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.

December 2005CSA3180: Information Extraction I1 CSA3180: Natural Language Processing Information Extraction 1 – Introduction Information Extraction Named.

Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)

Survey of Semantic Annotation Platforms

Graphical models for part of speech tagging

December 2005CSA3180: Information Extraction I1 CSA2050: Natural Language Processing Information Extraction Named Entities IE Systems MUC Finite State.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

Hidden Markov Models for Information Extraction CSE 454.

Some Work on Information Extraction at IRL Ganesh Ramakrishnan IBM India Research Lab.

Sequence Models With slides by me, Joshua Goodman, Fei Xia.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.

Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

Sequential Learning 1. WHAT IS SEQUENTIAL LEARNING? 2.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

John Lafferty Andrew McCallum Fernando Pereira

Conditional Markov Models: MaxEnt Tagging and MEMMs

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

Introduction to “Event Extraction” Jan 18, What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.

Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy

Hidden Markov Models BMI/CS 576

CSC 594 Topics in AI – Natural Language Processing

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Conditional Random Fields

Hidden Markov Models (HMMs) for Information Extraction

CSC 594 Topics in AI – Natural Language Processing

CSE 574 Finite State Machines for Information Extraction

CONTEXT DEPENDENT CLASSIFICATION

Conditional Random Fields model

WHIRL – Reasoning with IE output

IE With Undirected Models

Presentation transcript:

Information Extraction Yunyao Li EECS /SI /29/2006

The Problem Date Time: Start - End Speaker Person Location

What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Courtesy of William W. Cohen

What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: Courtesy of William W. Cohen October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

What is “Information Extraction” Courtesy of William W. Cohen Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction”

What is “Information Extraction” Courtesy of William W. Cohen Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Information Extraction = segmentation + classification + association + clustering

What is “Information Extraction” Courtesy of William W. Cohen October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Information Extraction = segmentation + classification + association + clustering

What is “Information Extraction” Courtesy of William W. Cohen October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Information Extraction = segmentation + classification + association + clustering Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEOMicrosoft Bill Veghte VP Microsoft Richard StallmanfounderFree Soft.. * * * *

Live Example: SeminarLive Example

Landscape of IE Techniques Courtesy of William W. Cohen Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Our Focus today!

Markov Property S2 S1 1/2 1/3 2/3 1 The state of a system at time t+1, q t+1, is conditionally independent of {q t-1, q t-2, …, q 1, q 0 } given q t In another word, current state determines the probability distribution for the next state. S1: rain S2: cloud S3: sun

Markov Property S2 S3 S1 1/2 1/3 2/3 1 State-transition probabilities, A = S1: rain S2: cloud S3: sun Q: given today is sunny (i.e., q 1 =3), what is the probability of “sun-cloud” with the model?

Hidden Markov Model S1: rain S2: cloud S3: sun S2 S3 S1 1/2 1/3 2/3 14/5 1/10 7/10 1/5 3/10 9/10 observations O1O1 O2O2 O3O3 O4O4 O5O5 state sequences

IE with Hidden Markov Model SI/EECS 767 is held weekly at SIN2. SI/EECS 767 is held weekly at SIN2 Course name: SI/EECS 767 Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “course name” state extract as a course name: course name location name background

Name Entity Extraction [Bikel, et al 1998] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Hidden states

Name Entity Extraction Transition probabilities Observation probabilities P(s t | s t-1, o t-1 ) P(o t | s t, s t-1 ) P(o t | s t, o t-1 ) or (1) Generating first word of a name-class (2) Generating the rest of words in the name-class (3) Generating “+end+” in a name-class

Training: Estimating Probabilities

Back-Off “unknown words” and insufficient training data P(s t | s t-1 ) P(s t ) P(o t | s t ) P(o t ) Transition probabilities Observation probabilities

HMM-Experimental Results Train on ~500k words of news wire text. Results:

Learning HMM for IE [Seymore, 1999] Consider labeled, unlabeled, and distantly-labeled data

Some Issues with HMM Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations

Maximum Entropy Markov Models S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations Courtesy of William W. Cohen [Lafferty, 2001]

MEMM S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history Courtesy of William W. Cohen

HMM vs. MEMM S t-1 StSt OtOt S t+1 O t+1 O t-1... S t-1 StSt OtOt S t+1 O t+1 O t-1...

Label Bias Problem with MEMM Consider this MEMM Pr(12|ro) = Pr(2|1,ro)Pr(1,ro) = Pr(2| 1,o)Pr(1,r) Pr(2|1,o) = Pr(2|1,r) = 1 Pr(12|ri) = Pr(2|1,ri)Pr(1,ri) = Pr(2| 1,i)Pr(1,r) Pr(12|ro) = Pr(12|ri) But it should be Pr(12|ro) < Pr(12|ri)!

Solve the Label Bias Problem Change the state-transition structure of the model –Not always practical to change the set of states Start with a fully-connected model and let the training procedure figure out a good structure –Prelude the use of prior, which is very valuable (e.g. in information extraction)

Random Field Courtesy of Rongkun Shen

Conditional Random Field Courtesy of Rongkun Shen

Conditional Distribution x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V f k and g k are given and fixed. g k is a Boolean vertex feature; f k is a Boolean edge feature k is the number of features are parameters to be estimated y| e is the set of components of y defined by edge e y| v is the set of components of y defined by vertex v If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

Conditional Distribution CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x

HMM like CRF Single feature for each state-state pair (y’,y) and state- observation pair in the data to train CRF Y t-1 YtYt XtXt Y t+1 X t+1 X t-1... = if y u = y’ and y v = y otherwise = if y v = y and x v = x otherwise y’,y and µ y,x are equivalent to the logarithm of the HMM transition probability Pr(y’|y) and observation probability Pr(x|y)

HMM like CRF For a chain structure, the conditional probability of a label sequence can be expressed in matrix form. For each position i in the observed sequence x, define matrix Where e i is the edge with label (y i-1, y i ) and v i is the vertex with label y i

HMM like CRF The normalization function is the (start, stop) entry of the product of these matrices The conditional probability of label sequence y is: where, y 0 = start and y n+1 = stop

Parameter Estimation The problem: determine the parameters From training data with empirical distribution The goal: maximize the log-likelihood objective function

Parameter Estimation – Iterative Scaling Algorithms Update the weights as and for Appropriately chosen for edge feature f k is the solution of T(x, y) is a global property of (x,y) and efficiently computing the Right-hand sides of the above equation is a problem

Algorithm S Define slack feature: For each index i = 0, …, n+1 we define forward vectors And backward vectors

Algorithm S = = =

The rate of convergence is governed by step size which is Inversely proportional to constant S, but S is generally quite large, resulting in slow convergence.

Algorithm T Keeps track of partial T total. It accumulates feature expectations into counters indexed by T(x) Use forward-back ward recurrences to compute the expectation a k,t of feature f k and b k,t of feature g k given that T(x) = t

Experiments Modeling label bias problem – 2000 training and 500 test samples generated by HMM –CRF error is 4.6% –MEMM error is 42% CRF solves label bias problem

Experiments Modeling mixed order sources –CRF converge in 500 iterations –MEMM converge in 100 iterations

MEMM vs. HMM The HMM outperforms the MEMM

CRF vs. MEMM CRF usually outperforms the MEMM

CRF vs. HMM Each open square represents a data set with  < ½, and a sold square indicates a data set with a   ½. When the data is mostly second order   ½, the discriminatively trained CRF usually outperforms the MEMM

POS Tagging Experiments First-order HMM, MEMM and CRF model Data set: Penn Tree bank 50-50% test-train split Uses MEMM parameter vector as a starting point for training the corresponding CRF to accelerate convergence speed.

Interactive IE using CRF Interactive parser updates IE results according to user’s changes. Color coding used to alert the ambiguity of IE.

Some IE tools Available MALLET (UMass) –statistical natural language processing, –document classification, –clustering, –information extraction –other machine learning applications to text. Sample Application: GeneTaggerCRF: a gene-entity tagger based on MALLET (MAchine Learning for LanguagE Toolkit). It uses conditional random fields to find genes in a text file.

“a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text” Stored documents can be annotated in independent files using TextLabels (denoting, say, part-of-speech and semantic information) MinorThird

GATE leading toolkit for Text Mining distributed with an Information Extraction component set called ANNIE (demo)Information Extractiondemo Used in many research projects –Long list can be found on its website –Under integration of IBM UIMA

Sunita Sarawagi's CRF package A Java implementation of conditional random fields for sequential labeling.

UIMA (IBM) Unstructured Information Management Architecture. –A platform for unstructured information management solutions from combinations of semantic analysis (IE) and search components.

Some Interesting Website based on IE ZoomInfo CiteSeer.org (some of us using it everyday!)CiteSeer.org Google Local, Google ScholarGoogle LocalGoogle Scholar and many more…