Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.

Presented by: Aneeta Kolhe

Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text mining and also for web search.

 Approximate dictionary matching.  Previous solution – Token based similarity constraints  Proposed solution – Neighborhood generation method

 It uses Jaccard co-efficient similarity  It may miss some match.  It may result in too many matches.

For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched unless use low jaccard similarity of 0.33. “alqaeda” will match “al gore” as well as “al pacino” Hence we use edit distance

 Problem Definition:  For example:  Given :document D, a dictionary E of entities  To find: all substrings in D such that they are within edit distance from one of the entities in E  Solution: Iterate through all the valid substrings of the document D  Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint.  Consider each substring as a query segment.

 at least one partition with at most one edit error  select k т = (т +1)/2 Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ] т = 3, k т = 2  s = [ abcdef ], [ ghijkl ]  s’ = [ axxbcde ], [ fghxijkl ]

 Shifting the first partition s by 2 => s = [cdef]  scaling it by -1 => s = [ cdefg ]  Transformation rules  First partition, we only need to consider scaling  within the range of [−2, 2].  Last partition, we only need to consider the combination of the same amount of shifting and scaling within the range of [− т, т ] (so that the last character is always included in the resulting substring).  For the rest of the partitions, we need to consider shifting within the range [− т, т ] and scaling within the range [−2, 2].

 1st partition: 5 variations  intermediate partitions: 5*(2 т +1) variations  last partition: (2 т +1) variations  Total amount of the 1-variants generated = O(m + 2).

 s = [ abcdef ], [ ghijkl ]  s’ = [ axxbcde ], [ fghxijkl ]   segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s partition variation [fghijkl ] generated from s’s second partition.

 The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants.  Assume l p is set to 3. Then 1-variants are generated from only the following prefixes.   By setting l p ≤ m/kт – 2  Total # of 1-variants generated is further reduced to O(l p т²).

 to index short and long entities  in the dictionary, and store them in two inverted indexes, Ishort and Ilong  For each entity whose length is smaller than kт lp + т  lp-prefix of each partition variation is used to generate its 1-variant family, which will be indexed.

 Algorithm : BuildIndex (E,, lp)  for each e Є E do  if |e| < k lp + then  V GenVariants(e[1.. min(lp, |e|)], );  /* The GenVariants (s, k) function generates  the k-variant family of string s */  for each v Є V do  Ishort <- Ishort U { e };  if |e| ≥ k lp then  P the set of k partitions of e;  for each i-th partition p Є P do  PT TransformPartition(p);  /* according to the three  transformation rules in Section 3.1 */  for each partition variations pT Є PT do  V GenVariants(p[1.. lp], 1);  for each v 2 V do  Ilong ;  return (Ishort, Ilong)

 Algorithm : MatchDocument (D, E, т )  for each starting position p Є[1, |D| − Lmin + т + 1] do  SearchLong (D[p.. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */  SearchShort (D[p.. p + lp − 1], E, т ); /* matching entities of length in [lmin, kт lp) */

 R <- ф; /* holds results */  C <- ф ; /* holds candidates */  V <- GenVariants(s, 1) ; /* gen 1-variant family */  for each v Є V do  for each Є Ilongv do  C ; /* duplicates removed */  7 for each Є C do  8 S <- QuerySegmentInstantiation(e, pid);  /* returns  the set of query segment candidates for e */  for each seg Є S do  if Verify(seg, e) = true then  R  Return R

 Search short(s)  We need to generate the т-variant families for each possible length l between Lmin − т and lp  If the current query segment is shorter than lp, every candidate pair formed by probing the index needs to be verified  Otherwise, we need to perform verification for 2 т + 1 possible query segments.

 For example, enumerate 1-variants of the string [ abcdef ] from left to right.  no variant starts with abc in the index.  Algorithm still enumerate other three 1- variants containing abc.  To avoid this set parameter lpp set to lp /2.

 Consider 4 possible cases: Prefix Match Suffix Match Action Truetrueenumerate all 1-variants of q[1.. lp] False discard q as there is no match FalseTrueenumerate all 1-variants of q[1.. lpp] False enumerate all 1-variants of q[(lpp + 1).. lp]

 Successfully reduced the size of neighborhood  Proposed an efficient query processing algorithm  Optimized the algorithm to share computation  Avoid unnecessary variant enumeration

Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.

Similar presentations

Presentation on theme: "Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.

Similar presentations

Presentation on theme: "Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text."— Presentation transcript:

Similar presentations

About project

Feedback