Presentation is loading. Please wait.

Presentation is loading. Please wait.

Uncertainty Management In Rule Based Information Extraction Systems Author Eirinaios Michelakis Rajasekar Krishnamurthy Peter J. Haas Shivkumar Vaithyanathan.

Similar presentations


Presentation on theme: "Uncertainty Management In Rule Based Information Extraction Systems Author Eirinaios Michelakis Rajasekar Krishnamurthy Peter J. Haas Shivkumar Vaithyanathan."— Presentation transcript:

1 Uncertainty Management In Rule Based Information Extraction Systems Author Eirinaios Michelakis Rajasekar Krishnamurthy Peter J. Haas Shivkumar Vaithyanathan Presented By Anurag Kulkarni 1

2  Rule based information extraction  Need o Uncertainty in extraction due to the varying precision associated with the rules used in a specific extraction task o Quantification of the Uncertainty for the extracted objects in Probabilistic databases(PDBs) o To improve recall of extraction tasks.  Types of Rule Based IE Systems 1. Trainable : rules are learned from data 2. Knowledge-engineered : rules are hand crafted by domain experts. Unstructured Data ( Any free Text) Structured Data (e.g. objects in the Database) User Defined Rules 2

3 AnnotatorRulesRule Precision PersonP1: P2: P3: High Low Phone NumberPh1: Ph2: Ph3: High Medium Low Person PhonePP1: PP2: PP3:[ ] High Medium 3

4 Annotator : A coordinated set of rules written for a particular IE task o base annotators – operate only over raw text o derived annotators - operate over previously defined annotations Annotations : Extracted objects Rules Consolidation rule (K): Special rule used to combine the outputs of the annotator rules. Candidate-generation rules (R): each individual rule Discard Rules: discard some candidates Merge Rules: merge a set of candidates to produce a result annotation.  Confidence : probability of the associated annotation being correct  Span: An annotator identifies a set of structured objects in a body of text, producing a set of annotations. Annotation a = (s1,..., sn) is a tuple of spans.  Annotation: Person and PhoneNumber E.g. input text “... Greg Mann can be reached at 403-663-2817...” s = “Greg Mann can be reached at 403-663-2817” s1 = ”Greg Mann” s2 = “403-663-2817” 4

5 Algorithm 1: Template for a Rule-based Annotator 5

6  Simple associating an arbitrary confidence rating of, e.g., “high”, “medium”, or “low” with each annotation is insufficient.  Need of Confidence associated with Annotation  Use of confidence number  enable principled assessments of risk or quality in applications that use extracted data.  improve the quality of annotators themselves  associate a probability with each annotation to capture the annotator’s confidence that the annotation is correct.  Modified rule based tuple (R,K,L, C). where training data L = (L D,L L ) L D set of training documents L L set of labels For example, a label might be represented as a tuple of the form (docID, s, Person), where s is the span corresponding to the Person annotation. C describes key statistical properties of the rules that comprise the annotator,  Modified Consolidate operator to include rule history  Modified procedure to include statistical model M 6

7  q(r) = P(A(s) = 1 | R(s) = r,K(s) = 1)  q(r) as the confidence associated with the annotation  R(s) =R1(s),R2(s),...,Rk(s), Ri(s) = 1 if and only if rule Ri holds for span s or at least one sub-span of s  A(s) = 1 if and only if spans corresponds to a true annotation  H the set of possible rule histories, H = { 0, 1 } k and r ∈ H using Bayes’ rule p 1 (r) = P( R(s) = r | A(s) = 1,K(s) = 1) and p 0 (r) = P( R(s) = r | A(s) = 0,K(s) = 1) setting π = P(A(s) = 1 | K(s) = 1). Then again applying bayes rule yields q(r) = π p 1 (r) /( π p1(r) + (1 − π )p 0 (r))  Here we have converted the problem of estimating a collection of posterior  probabilities to the problem of estimating the distributions p0 and p1, Unfortunately, whereas this method typically works well for estimating π, the estimates for p0 and p1 can be quite poor. The problem is data sparsity because there are 2k different possible r values and only a limited supply of labeled training data. 7

8 8

9  select a set C of important constraints that are satisfied by p1, and then to approximate p1 by the “simplest” distribution that obeys the constraints in C.  Following standard practice, we formalize the notion of “simplest distribution” as the distribution p satisfying the given constraints that has the maximum entropy value H(p),  where  Denoting by P the set of all probability distributions over H, we approximate p1 by the solution p to the maximization problem. maximize H(p) such that fc is the indicator function of that subset, so that fc(r) = 1 a c = computed directly from the training data L as Nc/N1. N1 is the number of spans s such that A(s) = 1 and K(s) = 1, and Nc is the number of these spans such that f c ( R(s))= 1 9

10  Reformulate our maximum-entropy problem as a more convenient maximum- likelihood (ML) problem.  θ = { θ c : c ∈ C } is the set of Lagrange multipliers for the original problem. To solve the inner maximization problem, take the partial derivative with respect to p(r) and set this derivative equal to 0, to obtain  where Z( θ ) is normalizing constant that ensures Substituting value of p(r) in above equation  But assuming ac is estimated from the training data 10

11  Multiply the objective function by the constant N1, and change the order of summation to find that solving the dual problem is equivalent to solving the optimization problem  The triples {A(s),K(s),R(s): s ∈ S } are mutually independent for any set S of distinct spans, and denote by S1 the set of spans such that A(s) = K(s) = 1. It can then be seen that the objective function in above equation is precisely the log-likelihood under the distribution of p(r) (prev slide)of observing, for each r ∈ H, exactly Nr rule histories in S1 equal to r.  The optimization problem rarely has a tractable closed form solution, and so approximate iterative solutions are used in practice. we use the Improved Iterative Scaling (IIS) algorithm 11

12  increases the value of the normalized log-likelihood l( θ ;L); here normalization refers to division by N1: starts with an initial set of parameters θ (0) = { 0,.., 0 } and, at the (t + 1)st iteration, attempt to find a new set of parameters θ (t+1) := θ (t) + δ (t) such that l( θ (t+1);L) > l( θ (t);L). 12

13  Increases the value of the normalized log-likelihood l( θ ;L); here normalization refers to division by N1: Starts with an initial set of parameters θ (0) = { 0,.., 0 } and, at the (t + 1)st iteration, attempt to find a new set of parameters θ (t+1) := θ (t) + δ (t) such that l( θ (t+1);L) > l( θ (t);L).  Denote by ( δ (t)) = ( δ (t); θ (t),L) the increase in the normalized log-likelihood between the t h and (t+1) st iterations: 13

14  IIS achieves efficient performance by solving a relaxed version of the above optimization problem at each step. Specifically, IIS chooses δ (t) to maximize a function Γ ( δ (t)) as follows with a = 1 14

15  Exact Decomposition Example Consider an annotator with R = {R1,R2,R3,R4 } and constraint set C = {C1,C2,C3,C4,C12,C23 }. Then the partitioning is { {R1,R2,R3}, {R4} }, and the algorithm fits two independent exponential distributions. The first distribution has parameters θ 1, θ 2, θ 3, θ 12, and θ 23, whereas the second distribution has a single parameter θ 4. For this example, we have d = 3  Approximate Decomposition The foregoing decomposition technique allows us to efficiently compute the exact ML solution for a large number of rules, provided that the constraints in C \C 0 correlate only a small number of rules, so that the foregoing maximum partition size d is small. 15

16  Q i (i = 1, 2) the annotation probability that the system associates with span s i.  for r ∈ H and q 1, q 2 ∈ [0, 1]  rewrite the annotation probabilities using Bayes rule:  where π = P(A(s, s1, s2) = 1 | K (d) (s, s1, s2) = 1 and  P (d) j = (r, q1, q2) = P(R (d) j ( s, s1, s2) = r Q2 = q2,Q1 = q1 |  A(s, s1, s2) = j, K (d) (s, s1, s2) = 1) 16

17  Data :emails from the Enron collection in which all of the true person names have been labeled  dataset consisted of 1564 person instances, 312 instances of phone numbers, and 219 instances of PersonPhone relationships.  IE system used : System T developed at IBM  Evaluation methods  Rule Divergence:Bin Divergence: 17

18 1) Pay as You Go: Data 2)Pay as You Go: Constraints We observed the accuracy of the annotation probabilities as the amount of labeled data increased. We observed the accuracy of the annotation probabilities as additional constraints were provided 18

19 3) Pay as You Go: Rules We observed the precision and recall of an annotator as new or improved rules were added. 19

20  The Need for Modeling Uncertainty  Probabilistic IE Model  Derivation of Parametric IE Model  Performance Improvements  Extending Probabilistic IE Model for derived annotators.  Evaluation using Rule Divergence and Bin Divergence  Judging Accuracy of the annotation using Pay as you go paradigm. 20

21 21


Download ppt "Uncertainty Management In Rule Based Information Extraction Systems Author Eirinaios Michelakis Rajasekar Krishnamurthy Peter J. Haas Shivkumar Vaithyanathan."

Similar presentations


Ads by Google