Uncertainty Management In Rule Based Information Extraction Systems Author Eirinaios Michelakis Rajasekar Krishnamurthy Peter J. Haas Shivkumar Vaithyanathan.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

CS4432: Database Systems II
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Visual Recognition Tutorial
Evaluation.
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
Factor Analysis Purpose of Factor Analysis
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Maximum likelihood (ML) and likelihood ratio (LR) test
Evaluation.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Evaluating Hypotheses
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Class 3: Estimating Scoring Rules for Sequence Alignment.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Computer vision: models, learning and inference
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum likelihood (ML)
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Radial Basis Function Networks
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
A brief maximum entropy tutorial. Overview Statistical modeling addresses the problem of modeling the behavior of a random process In constructing this.
Model Inference and Averaging
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Virtual University of Pakistan
Data Modeling Patrice Koehl Department of Biological Sciences
Chapter 7. Classification and Prediction
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
12. Principles of Parameter Estimation
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
5.2 Least-Squares Fit to a Straight Line
5.1 Introduction to Curve Fitting why do we fit data to a function?
Simple Linear Regression
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Recap: Naïve Bayes classifier
12. Principles of Parameter Estimation
Presentation transcript:

Uncertainty Management In Rule Based Information Extraction Systems Author Eirinaios Michelakis Rajasekar Krishnamurthy Peter J. Haas Shivkumar Vaithyanathan Presented By Anurag Kulkarni 1

 Rule based information extraction  Need o Uncertainty in extraction due to the varying precision associated with the rules used in a specific extraction task o Quantification of the Uncertainty for the extracted objects in Probabilistic databases(PDBs) o To improve recall of extraction tasks.  Types of Rule Based IE Systems 1. Trainable : rules are learned from data 2. Knowledge-engineered : rules are hand crafted by domain experts. Unstructured Data ( Any free Text) Structured Data (e.g. objects in the Database) User Defined Rules 2

AnnotatorRulesRule Precision PersonP1: P2: P3: High Low Phone NumberPh1: Ph2: Ph3: High Medium Low Person PhonePP1: PP2: PP3:[ ] High Medium 3

Annotator : A coordinated set of rules written for a particular IE task o base annotators – operate only over raw text o derived annotators - operate over previously defined annotations Annotations : Extracted objects Rules Consolidation rule (K): Special rule used to combine the outputs of the annotator rules. Candidate-generation rules (R): each individual rule Discard Rules: discard some candidates Merge Rules: merge a set of candidates to produce a result annotation.  Confidence : probability of the associated annotation being correct  Span: An annotator identifies a set of structured objects in a body of text, producing a set of annotations. Annotation a = (s1,..., sn) is a tuple of spans.  Annotation: Person and PhoneNumber E.g. input text “... Greg Mann can be reached at ” s = “Greg Mann can be reached at ” s1 = ”Greg Mann” s2 = “ ” 4

Algorithm 1: Template for a Rule-based Annotator 5

 Simple associating an arbitrary confidence rating of, e.g., “high”, “medium”, or “low” with each annotation is insufficient.  Need of Confidence associated with Annotation  Use of confidence number  enable principled assessments of risk or quality in applications that use extracted data.  improve the quality of annotators themselves  associate a probability with each annotation to capture the annotator’s confidence that the annotation is correct.  Modified rule based tuple (R,K,L, C). where training data L = (L D,L L ) L D set of training documents L L set of labels For example, a label might be represented as a tuple of the form (docID, s, Person), where s is the span corresponding to the Person annotation. C describes key statistical properties of the rules that comprise the annotator,  Modified Consolidate operator to include rule history  Modified procedure to include statistical model M 6

 q(r) = P(A(s) = 1 | R(s) = r,K(s) = 1)  q(r) as the confidence associated with the annotation  R(s) =R1(s),R2(s),...,Rk(s), Ri(s) = 1 if and only if rule Ri holds for span s or at least one sub-span of s  A(s) = 1 if and only if spans corresponds to a true annotation  H the set of possible rule histories, H = { 0, 1 } k and r ∈ H using Bayes’ rule p 1 (r) = P( R(s) = r | A(s) = 1,K(s) = 1) and p 0 (r) = P( R(s) = r | A(s) = 0,K(s) = 1) setting π = P(A(s) = 1 | K(s) = 1). Then again applying bayes rule yields q(r) = π p 1 (r) /( π p1(r) + (1 − π )p 0 (r))  Here we have converted the problem of estimating a collection of posterior  probabilities to the problem of estimating the distributions p0 and p1, Unfortunately, whereas this method typically works well for estimating π, the estimates for p0 and p1 can be quite poor. The problem is data sparsity because there are 2k different possible r values and only a limited supply of labeled training data. 7

8

 select a set C of important constraints that are satisfied by p1, and then to approximate p1 by the “simplest” distribution that obeys the constraints in C.  Following standard practice, we formalize the notion of “simplest distribution” as the distribution p satisfying the given constraints that has the maximum entropy value H(p),  where  Denoting by P the set of all probability distributions over H, we approximate p1 by the solution p to the maximization problem. maximize H(p) such that fc is the indicator function of that subset, so that fc(r) = 1 a c = computed directly from the training data L as Nc/N1. N1 is the number of spans s such that A(s) = 1 and K(s) = 1, and Nc is the number of these spans such that f c ( R(s))= 1 9

 Reformulate our maximum-entropy problem as a more convenient maximum- likelihood (ML) problem.  θ = { θ c : c ∈ C } is the set of Lagrange multipliers for the original problem. To solve the inner maximization problem, take the partial derivative with respect to p(r) and set this derivative equal to 0, to obtain  where Z( θ ) is normalizing constant that ensures Substituting value of p(r) in above equation  But assuming ac is estimated from the training data 10

 Multiply the objective function by the constant N1, and change the order of summation to find that solving the dual problem is equivalent to solving the optimization problem  The triples {A(s),K(s),R(s): s ∈ S } are mutually independent for any set S of distinct spans, and denote by S1 the set of spans such that A(s) = K(s) = 1. It can then be seen that the objective function in above equation is precisely the log-likelihood under the distribution of p(r) (prev slide)of observing, for each r ∈ H, exactly Nr rule histories in S1 equal to r.  The optimization problem rarely has a tractable closed form solution, and so approximate iterative solutions are used in practice. we use the Improved Iterative Scaling (IIS) algorithm 11

 increases the value of the normalized log-likelihood l( θ ;L); here normalization refers to division by N1: starts with an initial set of parameters θ (0) = { 0,.., 0 } and, at the (t + 1)st iteration, attempt to find a new set of parameters θ (t+1) := θ (t) + δ (t) such that l( θ (t+1);L) > l( θ (t);L). 12

 Increases the value of the normalized log-likelihood l( θ ;L); here normalization refers to division by N1: Starts with an initial set of parameters θ (0) = { 0,.., 0 } and, at the (t + 1)st iteration, attempt to find a new set of parameters θ (t+1) := θ (t) + δ (t) such that l( θ (t+1);L) > l( θ (t);L).  Denote by ( δ (t)) = ( δ (t); θ (t),L) the increase in the normalized log-likelihood between the t h and (t+1) st iterations: 13

 IIS achieves efficient performance by solving a relaxed version of the above optimization problem at each step. Specifically, IIS chooses δ (t) to maximize a function Γ ( δ (t)) as follows with a = 1 14

 Exact Decomposition Example Consider an annotator with R = {R1,R2,R3,R4 } and constraint set C = {C1,C2,C3,C4,C12,C23 }. Then the partitioning is { {R1,R2,R3}, {R4} }, and the algorithm fits two independent exponential distributions. The first distribution has parameters θ 1, θ 2, θ 3, θ 12, and θ 23, whereas the second distribution has a single parameter θ 4. For this example, we have d = 3  Approximate Decomposition The foregoing decomposition technique allows us to efficiently compute the exact ML solution for a large number of rules, provided that the constraints in C \C 0 correlate only a small number of rules, so that the foregoing maximum partition size d is small. 15

 Q i (i = 1, 2) the annotation probability that the system associates with span s i.  for r ∈ H and q 1, q 2 ∈ [0, 1]  rewrite the annotation probabilities using Bayes rule:  where π = P(A(s, s1, s2) = 1 | K (d) (s, s1, s2) = 1 and  P (d) j = (r, q1, q2) = P(R (d) j ( s, s1, s2) = r Q2 = q2,Q1 = q1 |  A(s, s1, s2) = j, K (d) (s, s1, s2) = 1) 16

 Data : s from the Enron collection in which all of the true person names have been labeled  dataset consisted of 1564 person instances, 312 instances of phone numbers, and 219 instances of PersonPhone relationships.  IE system used : System T developed at IBM  Evaluation methods  Rule Divergence:Bin Divergence: 17

1) Pay as You Go: Data 2)Pay as You Go: Constraints We observed the accuracy of the annotation probabilities as the amount of labeled data increased. We observed the accuracy of the annotation probabilities as additional constraints were provided 18

3) Pay as You Go: Rules We observed the precision and recall of an annotator as new or improved rules were added. 19

 The Need for Modeling Uncertainty  Probabilistic IE Model  Derivation of Parametric IE Model  Performance Improvements  Extending Probabilistic IE Model for derived annotators.  Evaluation using Rule Divergence and Bin Divergence  Judging Accuracy of the annotation using Pay as you go paradigm. 20

21