# Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1, Xuemin Lin 1 and Chengqi Zhang 2 1 University of New South.

## Presentation on theme: "Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1, Xuemin Lin 1 and Chengqi Zhang 2 1 University of New South."— Presentation transcript:

Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1, Xuemin Lin 1 and Chengqi Zhang 2 1 University of New South Wales and NICTA 2 University of Technology, Sydney

2 Named Entity Recognition  Dictionary-based NER Dictionary of Entities Isaac Newton Sigmund Freud English Austrian physicist mathematician astronomer philosopher alchemist theologian psychiatrist economist historian sociologist... Documents 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophi æ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. 2 Sigmund Freud was an Austrian psychiatrist who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalyst.

3 Approximate Entity Extraction  What if data are not cleaned or standardized? due to typos, multiple representations, etc.  Example – multiple representations al qaeda al qaida al-qaeda al-qa ’ ida  Using similarity measures token-based measures: jaccard e.g.  x = {al, qaeda}, y = {al, qaida}  J(x, y) = 1/3 = 0.33 If we set the threshold as 0.33, it works well for entities with several tokens, but, {al, qaeda} will match {al, gore} ! match the same entity!

4 Using Edit Distance Constraints  Using string-based measures edit-distance  Problem Definition Given a document R and a dictionary E of entities, the task of approximate entity extraction with edit distance threshold d is to find all sub-strings in R such that they are within edit distance d from one of the entities in E. { R[i.. j], E | k, ed(R[i.. j], E k )  d } E

5 Previous Approaches  q-gram based method count filtering  at least LB(s,t) common q-grams, where LB(s,t) = max(|s|, |t|) - q + 1 – q*d position filtering  positions of common q-grams should be within d length filtering  | len(s)-len(t) |  d  Steps index the q-grams for the entities probe index for the q-grams of each sub-string (query) of the document  form candidates verify the candidates Rhode_Island Rho hod ode de_ e_I _Is Isl sla lan and a Example: q = 3 at most q*d q-grams are destroyed

6 Drawbacks of q-gram Based Methods  entities are short we have to use small q to ensure the lower bound of matching q-grams is positive  short q-grams result in poor performance short q-grams are frequent  long inverted lists the lower bound is low for short entities  large candidate size  It has to try all the queries with length from L min – d to L max + d at every starting position. Document 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophi æ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. Dictionary (L min =9, L max =43) 1 physicist 2 mathematician 3 Philosophiæ Naturalis Principia Mathematica

7 FastSS Algorithm [T. Bocek et. al. 2007]  Basic Idea – Neighborhood Generation generate the variants for each entity and query by enumerating edit operations at any possible position  Steps enumerate by at most d deletions for each entity resulting strings are called d-variant family, inserted into inverted index generate d-variant family for each query, probe the index to form candidates, and then verify them  Example, d = 1 e = qaeda q = qaida V e = {qaeda, aeda, qeda, qada, qaea, qaed} V q = {qaida, aida, qida, qada, qaia, qaid}  Problem the size of d-variant family for each entity (query) is O(|s| d ) too many variants when entities are long or d is large!

8 Partitioning Scheme  How to reduce the number of variants? immediate solution: divide an entity (query) into several partitions generate d-variants within each partition only  guarantee not to miss any result  still too many variants? pigeon-hole principle If we consider shifting and scaling, there exists an entity partition and a query partition such that their edit distance is within 1  generate 1-variant family for each partition divide each entity (query) into k = ceil[(d+1)/2] partitions

Partitioning Scheme  divide each entity (query) into k = ceil[(d+1)/2] partitions  shift within the range of [-d, d]  scale within the range of [-2, 2] (it can be proved 2 is enough)  shifting an scaling are only needed on entities  special cases first partition: only need to consider scaling within [-2, 2] last partition: only need to consider same amount of shifting and scaling within [-d, d] dd 22 always start from the first character always end with the last character

10 Partitioning Scheme - Example  Example, d = 3 e = abcdefgh q = axxbcdefgyh  Partitioning k = 2 P e = { ;, ; ; ; ; ; ; ; ; ; } P q = { ; }  Generating 1-variants V {defgh} and V {defgyh} share a common variant ‘defgh’, so this candidate will be identified represented in the form of

11 Prefix Pruning  What if a partition is still quite long? still many 1-variants solution: generate 1-variant family on prefix only!  Prefix Pruning If a partition is longer than a threshold l, we only generate 1- variant family on its l-prefix.  Example, l = 5 P = abcdefg generate 1-variant family on its 5-prefix  P[1.. 5] = abcde  V p[1.. 5] = {abcde, bcde, acde, abce, abcd}  space complexity - # of variants generated FastSS: O(|s| d ) after partitioning and prefix pruning: O(l * d 2 )

12 NGPP Algorithm  Neighborhood Generation + Partitioning + Prefix  Balance between variant size and selectivity different schemes to deal with short and long entities  Index short and long entities short: for entities which are shorter than k*l+d, we index d- variant family on its l-prefix (prefix pruning only) long: for entities which are no shorter than k*l, we first divide them into k partitions, and index 1-variant family on the l- prefix of the partitions (partitioning + prefix pruning)  Scan documents scan for each starting position enumerate the query length from L min – d to l generate its d-variant family, search for short entities generate its 1-variant family, search for long entities

13 NGPP Example  d = 2, l = 4  short = 8  Entity e 1 = ‘ Providence ’ (long) e 2 = ‘ capital ’ (short)  Document Prowidnce is the kaepital of Rhode Island. genenrate 1- variant familiy pr pro prov provi provid vidence idence dence ence nce genenrate d- variant familiy capital Prow rowi owid e 1 Providence … kaep e 2 capital … 1-variant match d-variant match

14 Experiment Settings  Algorithms NGPP FastSS q-gram based method  Measure number of variants, candidate size, running time  Dataset dataset# of recordsavg. string length DBLP DICT (author)108k14.5 DOC (author, title)87k104.7 GENE DICT (gene/protein name)381k22.4 DOC (author, title, abstract)10k870.0 CONLL DICT (person, location)8k12.6 DOC (news article)19k819.0

15 Experiment Results  NGPP vs FastSS DBLP; d = 2 algorithm# of variantscandidate sizerunning time FastSS7500M2.1M2643s NGPP (l = 10) 150M11M40s

Experiment Results  NGPP vs q-gram based method DBLP; d = 1, 2, 3 Candidate SizeRunning Time

Conclusion  Contributions an efficient algorithm for approximate entity extraction with edit distance constraints based on neighborhood generation two techniques to reduce the number of variants generated, as well as running time  partitioning  prefix pruning  Future work approximate multiple pattern matching  other similarity measures, e.g., the function used in DNA/protein sequence alignment

18 Thank you! Questions?

19 Related Work  neighborhood generation approaches E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345 – 374, 1994. T. Bocek, E. Hunt, B. Stiller. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007.  q-gram based approaches L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933 – 944, 2008.  alternative: use vgrams instead of q-grams C. Li, B. Wang, and X. Yang. VGRAM: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, 2007. X. Yang, B. Wang, and C. Li. Cost-based variable length gram selection for string collections to support approximate queries efficiently. In SIGMOD, 2008.

Download ppt "Efficient Approximate Entity Extraction with Edit Distance Constraints Wei Wang 1, Chuan Xiao 1, Xuemin Lin 1 and Chengqi Zhang 2 1 University of New South."

Similar presentations