SUPERVISED LEARNING for nlp ENTITY EXTRACTION

SUPERVISED LEARNING for nlp ENTITY EXTRACTION
Heng Ji September 23, 2016

Supervised Learning based IE
‘Pipeline’ style IE Split the task into several components Prepare data annotation for each component Apply supervised machine learning methods to address each component separately Most state-of-the-art ACE IE systems were developed in this way Provide great opportunity to applying a wide range of learning models and incorporating diverse levels of linguistic features to improve each component Large progress has been achieved on some of these components such as name tagging and relation extraction

Major IE Components Name/Nominal Extraction
“Barry Diller”, “chief” Entity Coreference Resolution “Barry Diller” = “chief” Time Identification and Normalization Wednesday ( ) “Vivendi Universal Entertainment” is located in “France” This structure exists on many levels: the structure of names; the grammatical structure of sentences; and the coreference structure across a discourse (and even across multiple discourses). Each of these is important to IE … to figuring out the participants in an event. And each of these has been studied separately and quite intensively over the past decade. Annotated corpora have been prepared for each of these levels of structure, and a wide range of models and machine learning methods have been applied to construct analyzers (particularly for name and grammatical structure). Except for coreference analysis, the result of these efforts have in general been quite satisfactory levels of performance … on the order of 90% accuracy for names and for grammatical constituents Relation Extraction “Barry Diller” is the person of the end-position event trigged by “quit” Event Mention Extraction and Event Coreference Resolution 3

Name Tagging: Task B-PER I-PER I-PER O B-GPE George W. Bush discussed
Person (PER): named person or family Organization (ORG): named corporate, governmental, or other organizational entity Geo-political entity (GPE): name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.) But also: Location, Artifact, Facility, Vehicle, Weapon, Product, etc. Extended name hierarchy, 150 types, domain-dependent (Sekine and Nobata, 2004) Convert it into a sequence labeling problem – “BIO” tagging: <PER>George W. Bush</PER> discussed <GPE>Iraq</GPE> B-PER I-PER I-PER O B-GPE George W. Bush discussed Iraq

Supervised Learning for Name Tagging
Maximum Entropy Models (Borthwick, 1999; Chieu and Ng 2002; Florian et al., 2007) Decision Trees (Sekine et al., 1998) Class-based Language Model (Sun et al., 2002, Ratinov and Roth, 2009) Agent-based Approach (Ye et al., 2002) Support Vector Machines (Takeuchi and Collier, 2002) Sequence Labeling Models Hidden Markov Models (HMMs) (Bikel et al., 1997; Ji and Grishman, 2005) Maximum Entropy Markov Models (MEMMs) (McCallum and Freitag, 2000) Conditional Random Fields (CRFs) (McCallum and Li, 2003)

Typical Name Tagging Features
N-gram: Unigram, bigram and trigram token sequences in the context window of the current token Part-of-Speech: POS tags of the context words Gazetteers: person names, organizations, countries and cities, titles, idioms, etc. Word clusters: to reduce sparsity, using word clusters such as Brown clusters (Brown et al., 1992) Case and Shape: Capitalization and morphology analysis based features Chunking: NP and VP Chunking tags Global feature: Sentence level and document level features. For example, whether the token is in the first sentence of a document Conjunction: Conjunctions of various features

Markov Chain for a Simple Name Tagger
George:0.3 Transition Probability 0.6 W.:0.3 Bush:0.3 Emission Probability PER Iraq:0.1 0.2 $:1.0 0.3 0.1 END START LOC 0.2 0.3 0.2 0.2 0.3 0.3 0.1 George:0.2 0.2 From yesterday’s talk Generative: center of distribution; Discriminate: margin of distributions. Hmm matches sequential sequence better… Future…….T2N 0.5 X Iraq:0.8 W.:0.3 0.5 discussed:0.7

Viterbi Decoding of Name Tagger
George W. Bush discussed Iraq $ t=0 t=1 t=2 t=3 t=4 t=5 t=6 START 1 1 1*0.3*0.3 PER 0.09 0.0162 0.003 0.0012 0.0003 LOC 0.004 X From this picture we can also that since we are assigning tag for each word, there are many other methods can be used to address this task. One nice thing of HMM is it matches the character of the task – we are now focusing on a sentence. 0.0054 0.0004 0.0036 END Current = Previous * Transition * Emission

Limitations of HMMs Joint probability distribution p(y, x)
Assume independent features Cannot represent overlapping features or long range dependences between observed elements Need to enumerate all possible observation sequences Very strict independence assumptions on the observations Toward discriminative/conditional models Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models

Maximum Entropy Why maximum entropy?
Maximize entropy = Minimize commitment Model all that is known and assume nothing about what is unknown. Model all that is known: satisfy a set of constraints that must hold Assume nothing about what is unknown: choose the most “uniform” distribution  choose the one with maximum entropy

Why Try to be Uniform? Most Uniform = Maximum Entropy
By making the distribution as uniform as possible, we don’t make any additional assumptions to what is supported by the data Abides by the principle of Occam’s Razor (least assumption = simplest explanation) Less generalization errors (less over-fitting) more accurate predictions on test data Mention…less generalization error…predict more accurately….won’t avoid overfitting….flat…won’t violate the observation…

Learning Coreference by Maximum Entropy Model
Suppose that if the feature “Capitalization” = “Yes” for token t, then P (t is the beginning of a Name | (Captalization = Yes)) = 0.7 How do we adjust the distribution? P (t is not the beginning of a name | (Capitalization = Yes)) = 0.3 If we don’t observe “Has Title = Yes” samples? P (t is the beginning of a name | (Has Title = Yes)) = 0.5 P (t is not the beginning of a name | (Has Title = Yes)) = 0.5

The basic idea Goal: estimate p
Choose p with maximum entropy (or “uncertainty”) subject to the constraints (or “evidence”).

Setting From training data, collect (a, b) pairs:
a: thing to be predicted (e.g., a class in a classification problem) b: the context Ex: Name tagging: a=person b=the words in a window and previous two tags Learn the prob of each (a, b): p(a, b)

Ex1: Coin-flip example (Klein & Manning 2003)
Toss a coin: p(H)=p1, p(T)=p2. Constraint: p1 + p2 = 1 Question: what’s your estimation of p=(p1, p2)? Answer: choose the p that maximizes H(p) H p1 p1=0.3

Coin-flip example (cont)
H p1 + p2 = 1 p2 p1 p1+p2=1.0, p1=0.3

Ex2: An MT example (Berger et. al., 1996)
Possible translation for the word “in” is: Constraint: Intuitive answer:

An MT example (cont) Constraints: Intuitive answer:

Why ME? Advantages Combine multiple knowledge sources
Local Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996)) Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002)) Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar & Ratnaparkhi, 1997)) Global N-grams (Rosenfeld, 1997) Word window Document title (Pakhomov, 2002) Structurally related words (Chao & Dyer, 2002) Sentence length, conventional lexicon (Och & Ney, 2002) Combine dependent knowledge sources

Why ME? Advantages Add additional knowledge sources Implicit smoothing
Disadvantages Computational Expected value at each iteration Normalizing constant Overfitting Feature selection Cutoffs Basic Feature Selection (Berger et al., 1996)

Maximum Entropy Markov Models (MEMMs)
A conditional model that representing the probability of reaching a state given an observation and the previous state Consider observation sequences to be events to be conditioned upon. Have all the advantages of Conditional Models No longer assume that features are independent Do not take future observations into account (no forward-backward) Subject to Label Bias Problem: Bias toward states with fewer outgoing transitions

Conditional Random Fields (CRFs)
Conceptual Overview Each attribute of the data fits into a feature function that associates the attribute and a possible label A positive value if the attribute appears in the data A zero value if the attribute is not in the data Each feature function carries a weight that gives the strength of that feature function for the proposed label High positive weights: a good association between the feature and the proposed label High negative weights: a negative association between the feature and the proposed label Weights close to zero: the feature has little or no impact on the identity of the label CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Weights of different features at different states can be traded off against each other CRFs provide the benefits of discriminative models

Example of CRFs

Sequential Model Trade-offs
Speed Discriminative vs. Generative Normalization HMM very fast generative local MEMM mid-range discriminative CRF relatively slow global

Support Vector Machines
First I will talk about the background of my research, which is information extraction from free text, and the limitations of Prior approaches. To overcome these limitations, I will propose a new kernel –based model to solve this problems. This model was experimented on two tasks: 1), 2) For these two experiments, I will show the experimental results and comparisons with other approaches

Problems in classifying data
Often high dimension of data. Hard to put up simple rules. Amount of data. Need automated ways to deal with the data. Use computers – data processing, statistical analysis, try to learn patterns from the data (Machine Learning)

Black box view of Machine Learning
Test-data Prediction = cancer or not Black box view of Machine Learning Training data Model Model Magic black box (learning machine) Training data: -Expression patterns of some cancer + expression data from healty person Model: The model can distinguish between healty and sick persons. Can be used for prediction.

Tennis example 2 Temperature Humidity = play tennis
= do not play tennis

Linearly Separable Classes

Linear Support Vector Machines
Data: <xi,yi>, i=1,..,l xi  Rd yi  {-1,+1} x2 =+1 =-1 x1

Linear SVM 2 Data: <xi,yi>, i=1,..,l xi  Rd yi  {-1,+1} =-1
=+1 Data: <xi,yi>, i=1,..,l xi  Rd yi  {-1,+1} f(x) All hyperplanes in Rd are parameterized by a vector (w) and a constant b. Can be expressed as w•x+b=0 (remember the equation for a hyperplane from algebra!) Our aim is to find such a hyperplane f(x)=sign(w•x+b), that correctly classify our data.

Selection of a Good Hyper-Plane
Objective: Select a `good' hyper-plane using only the data! Intuition: (Vapnik 1965) - assuming linear separability (i) Separate the data (ii) Place hyper-plane `far' from data

Definitions Define the hyperplane H such that: xi•w+b  +1 when yi =+1
H1 and H2 are the planes: H1: xi•w+b = +1 H2: xi•w+b = -1 The points on the planes H1 and H2 are the Support Vectors: H d+ = the shortest distance to the closest positive point d- = the shortest distance to the closest negative point The margin of a separating hyperplane is d+ + d-.

Maximizing the margin We want a classifier with as big margin as possible. H1 d+ d- H H2 Recall the distance from a point(x0,y0) to a line: Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) The distance between H and H1 is: |w•x+b|/||w||=1/||w|| The distance between H1 and H2 is: 2/||w|| In order to maximize the margin, we need to minimize ||w||. With the condition that there are no datapoints between H1 and H2: xi•w+b  +1 when yi =+1 xi•w+b  -1 when yi = Can be combined into yi(xi•w)  1

Optimization Problem

Non-Separable Case

Problems with linear SVM
=-1 =+1 What if the decison function is not a linear?

Non-linear SVM 1 The Kernel trick Imagine a function  that maps the data into another space: =Rd Rd  =-1 =+1  =-1 =+1 Remember the function we want to optimize: LD = i – ½ijyiyjxi•xj, xi and xj as a dot product. We will have (xi) • (xj) in the non-linear case. If there is a ”kernel function” K such as K(xi,xj) = (xi) • (xj), we do not need to know  explicitly. One example:

Neural Networks

Recall: The Neuron Metaphor
Neurons accept information from multiple inputs, transmit information to other neurons. Multiply inputs by weights along edges Apply some function to the set of inputs at each node

Types of Neurons Linear Neuron Logistic Neuron
Potentially more. Require a convex loss function for gradient descent training. Perceptron

Multilayer Networks Cascade Neurons together
The output from one layer is the input to the next Each Layer has its own sets of weights

Linear Regression Neural Networks
What happens when we arrange linear neurons in a multilayer network?

Linear Regression Neural Networks
Nothing special happens. The product of two linear transformations is itself a linear transformation. f(x,\vec{\theta})&=&\sum_{i=0}^D\theta_{1,i}\sum_{n=0}^{N-1}\theta_{0,i,n} {x_n}\\f(x,\vec{\theta})&=&\sum_{i=0}^D\theta_{1,i}[\theta_{0,i}^T\vec{x}]\\f(x,\vec{\theta})&=&\sum_{i=0}^D[\hat\theta_{i}^T\vec{x}]\\

Neural Networks We want to introduce non-linearities to the network.
Non-linearities allow a network to identify complex regions in space

Linear Separability 1-layer cannot handle XOR
More layers can handle more complicated spaces – but require more parameters Each node splits the feature space with a hyperplane If the second layer is AND a 2-layer network can represent any convex hull.

Feed-Forward Networks
Predictions are fed forward through the network to classify

Error Backpropagation
We will do gradient descent on the whole network. Training will proceed from the last layer to the first.

Introduce variables over the neural network \vec{\theta}=\{w_{ij}, w_{jk}, w_{kl}\}

Introduce variables over the neural network Distinguish the input and output of each node \vec{\theta}=\{w_{ij}, w_{jk}, w_{kl}\}

\vec{\theta}=\{w_{ij}, w_{jk}, w_{kl}\}

Training: Take the gradient of the last component and iterate backwards \vec{\theta}=\{w_{ij}, w_{jk}, w_{kl}\}

Empirical Risk Function

Optimize last layer weights wkl Calculus chain rule \frac{\partial R}{\partial w_{kl}}=\frac{1}{N}\sum_n\left[\frac{\partial L_n}{\partial a_{l,n}}\right]\left[\frac{\partial a_{l,n}}{\partial w_{kl}}\right]

Optimize last layer weights wkl Calculus chain rule

Error Back-propagation
Error backprop unravels the multivariate chain rule and solves the gradient for each partial component separately. The target values for each layer come from the next layer. This feeds the errors back along the network.

Problems with Neural Networks
Interpretation of Hidden Layers Overfitting

Interpretation of Hidden Layers
What are the hidden layers doing?! Feature Extraction The non-linearities in the feature extraction can make interpretation of the hidden layers very difficult. This leads to Neural Networks being treated as black boxes.

Overfitting in Neural Networks
Neural Networks are especially prone to overfitting. Recall Perceptron Error Zero error is possible, but so is more extreme overfitting Logistic Regression Perceptron

Handwriting Recognition
Demo:

Convolutional Network
The network is not fully connected. Different nodes are responsible for different regions of the image. This allows for robustness to transformations.

Other Neural Networks Multiple Outputs Skip Layer Network
Recurrent Neural Networks

Multiple Outputs Used for N-way classification.
Each Node in the output layer corresponds to a different class. No guarantee that the sum of the output vector will equal 1.

Skip Layer Network Input nodes are also sent directly to the output layer.

Recurrent Neural Networks
Output or hidden layer information is stored in a context or memory layer. Output Layer Hidden Layer Context Layer Input Layer

Time Delayed Recurrent Neural Networks (TDRNN)
Output layer from time t are used as inputs to the hidden layer at time t+1. Output Layer With an optional decay Hidden Layer Input Layer

State-of-the-art and Remaining Challenges
State-of-the-art Performance On ACE data sets: about 89% F-measure (Florian et al., 2006; Ji and Grishman, 2006; Nguyen et al., 2010; Zitouni and Florian, 2008) On CONLL data sets: about 91% F-measure (Lin and Wu, 2009; Ratinov and Roth, 2009) Remaining Challenges Identification, especially on organizations Boundary: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore” Need coreference resolution or context event features: “FAW has also utilized the capital market to directly finance, and now owns three domestic listed companies” (FAW = First Automotive Works) Classification “Caribbean Union”: ORG or GPE?

Obtaining Gazetteers Automatically?
Data is Power Web is one of the largest text corpora: however, web search is slooooow (if you have a million queries). N-gram data: compressed version of the web Already proven to be useful for language modeling Tools for large N-gram data sets are not widely available What are the uses of N-grams beyond language models?

Counts on the Web

died in (a|an) _ accident
car 13966, automobile 2954, road 1892, auto 1650, traffic 1549, tragic 1480, motorcycle 1399, boating 823, freak 733, drowning 438, vehicle 417, hunting 304, helicopter 289, skiing 281, mining 254, train 250, airplane 236, plane 234, climbing 231, bus 208, motor 198, industrial 187, swimming 180, training 170, motorbike 155, aircraft 152, terrible 137, riding 136, bicycle 132, diving 127, tractor 115, construction 111, farming 107, horrible 105, one-car 104, flying 103, hit-and-run 99, similar 89, racing 89, hiking 89, truck 86, farm 81, bike 78, mine 75, carriage 73, logging 72, unfortunate 71, railroad 71, work-related 70, snowmobile 70, mysterious 68, fishing 67, shooting 66, mountaineering 66, highway 66, single-car 63, cycling 62, air 59, boat 59, horrific 56, sailing 55, fatal 55, workplace 50, skydiving 50, rollover 50, one-vehicle 48, <UNK> 48, work 47, single-vehicle 47, vehicular 45, kayaking 43, surfing 42, automobile 41, car 40, electrical 39, ATV 39, railway 38, Humvee 38, skating 35, hang-gliding 35, canoeing 35, , shuttle 34, parachuting 34, jeep 34, ski 33, bulldozer 31, aviation 30, van 30, bizarre 30, wagon 27, two-vehicle 27, street 27, glider 26, " 25, sawmill 25, horse 25, bomb-making 25, bicycling 25, auto 25, alcohol-related 24, snowboarding 24, motoring 24, early-morning 24, trucking 23, elevator 22, horse-riding 22, fire 22, two-car 21, strange 20, mountain-climbing 20, drunk-driving 20, gun 19, rail 18, snowmobiling 17, mill 17, forklift 17, biking 17, river 16, motorcyle 16, lab 16, gliding 16, bonfire 16, apparent 15, aeroplane 15, testing 15, sledding 15, scuba-diving 15, rock-climbing 15, rafting 15, fiery 15, scooter 14, parachute 14, four-wheeler 14, suspicious 13, rodeo 13, mountain 13, laboratory 13, flight 13, domestic 13, buggy 13, horrific 12, violent 12, trolley 12, three-vehicle 12, tank 12, sudden 12, stupid 12, speedboat 12, single 12, jousting 12, ferry 12, airplane 12, unrelated 11, transporter 11, tram 11, scuba 11, common 11, canoe 11, skateboarding 10, ship 10, paragliding 10, paddock 10, moped 10, factory 10

A Typical Name Tagger Name labeled corpora: 1375 documents, about 16,500 name mentions Manually constructed name gazetteer including 245,615 names Census data including 5,014 person-gender pairs. Including myself.

Gender Discovery Examples
If a mention indicates male and female with high confidence, it’s likely to be a person mention Patterns for candidate mentions male female neutral plural John Joseph bought/… his/… 32 Haifa and its/… 21 19 92 15 screenwriter published/… his/… 144 27 it/… is/… fish 22 41 1741 1186 Don’t have to be smart people.

Animacy Discovery Examples
If a mention indicates animacy with high confidence, it’s likely to be a person mention Patterns for candidate mentions Animate Non-Animate who when where which supremo 24 shepherd 807 56 prophet 7372 1066 63 1141 imam 910 76 57 oligarchs 299 13 28 sheikh 338 11 Give pattern, along with examples.

Overall Procedure Online Processing Offline Processing Test doc Google
N-Grams Token Scanning& Stop-word Filtering Gender & Animacy Knowledge Discovery Candidate Name Mentions Candidate Nominal Mentions Confidence Estimation Give pattern, along with examples. Confidence (noun, masculine/feminine/animate) Fuzzy Matching Person Mentions

Unsupervised Mention Detection Using Gender and Animacy Statistics
Candidate mention detection Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words Nominal: un-capitalized sequence of <=3 words without stop words Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property) Confidence (candidate, Male/Female/Animate) > Full matching: candidate = full string Composite matching: candidate = each token in the string Relaxed matching: Candidate = any two tokens in the string Give pattern, along with examples.

Property Matching Examples
Mention candidate Matching Method String for matching Property Frequency masculine feminine neutral plural John Joseph Full Matching 32 Ayub Masih Composite Matching Ayub 87 Masih 117 Mahmoud Salim Qawasmi Relaxed Matching Mahmoud 159 13 Salim 188 Qawasmi Give pattern, along with examples.

Separate Wheat from Chaff: Confidence Estimation
Rank the properties for each noun according to their frequencies: f1 > f2 > … > fk Link to sushant.

Experiments: Data Candidate mention detection
Name: capitalized sequence of <=3 words; filter stop words, nationality words, dates, numbers and title words Nominal: un-capitalized sequence of <=3 words without stop words Margin Confidence Estimation freq (best property) – freq (second best property) freq (second best property) Confidence (candidate, Male/Female/Animate) > Full matching: candidate = full string Composite matching: candidate = each token in the string Relaxed matching: Candidate = any two tokens in the string Give pattern, along with examples.

Impact of Knowledge Sources on Mention Detection for Dev Set
Patterns applied to ngrams for Name Mentions P(%) R(%) F(%) Conjunction-Possessive John and his 68.57 64.86 66.67 +Verb-Nominate John thought he 69.23 72.97 71.05 +Animacy John, who 85.48 81.96 83.68 Patterns applied to ngrams for Nominal Mentions P(%) R(%) F(%) Conjunction-Possessive writer and his 78.57 10.28 18.18 +Predicate He is a writer 20.56 32.59 +Verb-Nominate writer thought he 65.85 25.23 36.49 +Verb-Possessive writer bought his 55.71 36.45 44.07 +Verb-Reflexive writer explained himself 64.41 35.51 45.78 +Animacy writer, who 63.33 71.03 66.96 Give numbers, and the examples of errors.

Name Tagging: “Old” Milestones
Year Tasks & Resources Methods F-Measure Example References 1966 - First person name tagger with punch card 30+ decision tree type rules (Borkowski et al., 1966) 1998 MUC-6 MaxEnt with diverse levels of linguistic features 97.12% (Borthwick and Grishman, 1998) 2003 CONLL System combination; Sequential labeling with Conditional Random Fields 89% (Florian et al., 2003; McCallum et al., 2003; Finkel et al., 2005) 2006 ACE Diverse levels of linguistic features, Re-ranking, joint inference ~89% (Florian et al., 2006; Ji and Grishman, 2006) Our progress compared to 1966: More data, a few more features and more fancy learning algorithms Not much active work after ACE because we tend to believe it’s a solved problem…

The end of extreme happiness is sadness…
State-of-the-art reported in papers

The end of extreme happiness is sadness…
Experiments on ACE2005 data

What’s Wrong? Name tagger s are getting old (trained from 2003 news & test on 2012 news) Genre adaptation (informal contexts, posters) Revisit the definition of name mention – extraction for linking Limited types of entities (we really only cared about PER, ORG, GPE) Old unsolved problems Identification: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore” Classification: “FAW has also utilized the capital market to directly finance,…” (FAW = First Automotive Works) Potential Solutions for Quality Word clustering, Lexical Knowledge Discovery (Brown, 1992; Ratinov and Roth, 2009; Ji and Lin, 2010) Feedback from Linking, Relation, Event (Sil and Yates, 2013; Li and Ji, 2014) Potential Solutions for Portability Extend entity types based on AMR (140+)

SUPERVISED LEARNING for nlp ENTITY EXTRACTION

Similar presentations

Presentation on theme: "SUPERVISED LEARNING for nlp ENTITY EXTRACTION"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SUPERVISED LEARNING for nlp ENTITY EXTRACTION

Similar presentations

Presentation on theme: "SUPERVISED LEARNING for nlp ENTITY EXTRACTION"— Presentation transcript:

Similar presentations

About project

Feedback