Presentation is loading. Please wait.

Presentation is loading. Please wait.

VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern.

Similar presentations


Presentation on theme: "VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern."— Presentation transcript:

1 VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern University) Xiaochun Yang (Northeastern University) Presented by Jae-won Lee

2 Copyright  2006 by CEBT Introduction  Many applications have an increasing need to support approximate string queries on data collections  Examples of approximate string queries Data Cleaning – the same entity can be represented in slightly different forms – “PO BOX 23” and “P.O. Box 23” Query Relaxation – errors in the query, inconsistencies in the data, limited knowledge about the data – “Steven Spielburg” and “Steve Spielberg” Spellchecking – find potential candidates for a possibly mistyped word IDS Lab. Seminar - 2Center for E-Business Technology

3 Copyright  2006 by CEBT Introduction  Dilemma of Choosing Gram Length The gram length can greatly affect the performance of string matches Increasing gram length – Causes the inverted list to be shorter This may decrease the time to merge the inverted lists – Cases the lower threshold on the number of common grams This causes a less selectiveness IDS Lab. Seminar - 3Center for E-Business Technology id strings 0123401234 rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 23 0 1 4 2 0 13 0124 4 124 3 3 # of common grams >= 3 id strings 0123401234 rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck 4 2 4 1 2 0 1 0 3 4 1 3 4 2 3 id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static # of common grams >= 1

4 Copyright  2006 by CEBT VGRAM : Main Idea  We analyze the frequencies of variable-length grams in the strings, and select a set of grams, called gram dictionary  For a string, we generate a set of grams of variable lengths using the gram dictionary  Challenges How to generate variable-length grams ? How to construct a high-quality gram dictionary ? What is the relationship between string similarity and their gram-set similarity? How to adopt VGRAM in existing algorithms ? IDS Lab. Seminar - 4Center for E-Business Technology

5 Copyright  2006 by CEBT Challenge 1 : Generating Variable-Length Grams  Example String s = universal D = {ni, ivr, sal, uni, vers} q min = 2, q max = 4 By setting position p = 1, VG = {} The longest substring starting at u that appears in D is uni  (1, uni) Move to the next character n, the longest substring is ni – However, this candidate (2, ni) is subsumed by the previous one, the algorithm does not insert it into VG Move to the next character i, there is no substring starting at this character that matches a gram in D, so the algorithm produces (3, iv) of length q min = 2 Final set VG(s) = {(1, uni), (3, iv), (4, vers), (7, sal)} IDS Lab. Seminar - 5Center for E-Business Technology

6 Copyright  2006 by CEBT Challenge 2:Constructing Gram Dictionary  Step 1 : Collecting gram frequencies with length in [q min =2, q max =4] IDS Lab. Seminar - 6Center for E-Business Technology st  0, 1, 3 sti  0, 1 stu  3 stic  0, 1 stuc  3 Leaf node

7 Copyright  2006 by CEBT Challenge 2:Constructing Gram Dictionary  Step 2: Selecting High-Quality Grams If a gram has a low frequency, we eliminate from the tree all the extended grams of g If a gram is very frequent, keep some of its extended grams IDS Lab. Seminar - 7Center for E-Business Technology

8 Copyright  2006 by CEBT Challenge 2:Constructing Gram Dictionary  Pruning tree using a frequency threshold T = 2 Frequency of node (which has leaf node) ≤ T IDS Lab. Seminar - 8Center for E-Business Technology 8 removed

9 Copyright  2006 by CEBT Challenge 2:Constructing Gram Dictionary  Pruning tree using a frequency threshold T = 2 Frequency of node (which has leaf node) ≥ T Pruning policies to be used to select a maximal subset of children to remove – SmallFirst : choose children with the smallest frequencies – LargeFirst : choose children with the largest frequencies – Random : Randomly choose children so that L.freq is not greater than T IDS Lab. Seminar - 9Center for E-Business Technology

10 Copyright  2006 by CEBT Challenge 3:Similarity of Gram Sets  Analyzing the effect of an edit operation on the positional grams These effects are stored NAG Vector (the vector of number of affected grams) Category 1 : for positional gram (p, g) – p i+q max -1 Category 2 : p ≤ i ≤ p+|g| -1 Category 3 : positional gram (p, g) on the left of the i-th character Category 4 : positional gram (p, g) on the right of the i-th character IDS Lab. Seminar - 10Center for E-Business Technology i-q max +1i+q max - 1 Deletion i String s Category 1 Category 3 Category 2 Category 4 Category 1

11 Copyright  2006 by CEBT Challenge 3:Similarity of Gram Sets  Example S = universal, D= {ni, ivr, sal, uni, vers}, q min = 2, q max = 4 VG(s) = {(1, uni), (3, iv), (4,vers), (7,sal)} A deletion on the 5-th character e in the string s i-q max +1 =2, i+q max -1 = 8 Positional gram (1, uni) and (7, sal) is category 1 – Starting position is before 2 / ending position is after 8 These gram are not affected by deletion operation (4, vers) is category 2 (3, iv) is category 3 – Since there is an extension of iv in D (ivr), (3, iv) could be affected by the deletion (potentially affected) IDS Lab. Seminar - 11Center for E-Business Technology

12 Copyright  2006 by CEBT Challenge 3:Similarity of Gram Sets  # of grams affected by each operation We want to transform string s to string s’ with 2 edit operations – At most 4 grams can be affected IDS Lab. Seminar - 12Center for E-Business Technology _ u _ n _ i _v _ e _ r _s _ a _ l _ 01 11 12 12 1 1 11 12 11 1 10 Deletion/substitutionInsertion GAP ; insertion ? String S’ # of edit operation# of grams

13 Copyright  2006 by CEBT Challenge 4: Adopting VGRAM Technique  Example of Algorithm based on Inverted Lists Query : Edit Distance (shtick, ?) ≤ 1 VG(q) = { (1, sh), (2, ht), (3, tick) } ; which are extracted using gram dictionary IDS Lab. Seminar - 13Center for E-Business Technology 124 12 1 0 4 3 … ck ic … ti … # of common grams = (|s 1 |- q + 1) – k * q = (6-2+1) – 1 * 2 = 3 2 grams 24 1 4 0 1 1 2 3 … ck ic ich … tic tick … 2-4 grams id strings 0123401234 rich stick stich stuck static # of common grams = |VG(q)| - NAG(q, k) = 3 – 2 = 1

14 Copyright  2006 by CEBT Experiments  Data Sets Data set 1: Texas Real Estate Commission. – 151K person names, average length = 33. Data set 2: English dictionary from the Aspell spellchecker for Cygwin. – 149,165 words, average length = 8. Data set 3: DBLP Bibliography. – 277K titles, average length = 62. IDS Lab. Seminar - 14Center for E-Business Technology

15 Copyright  2006 by CEBT VGRAM Overhead  Data set 3 IDS Lab. Seminar - 15Center for E-Business Technology Index SizeConstruction Time

16 Copyright  2006 by CEBT Benefits of Using Variable-Length Grams  Data set 1 IDS Lab. Seminar - 16Center for E-Business Technology Construction Time/SizeQuery Time

17 Copyright  2006 by CEBT Effect of q max  Data Set 1 IDS Lab. Seminar - 17Center for E-Business Technology Construction Time / Query TimeQuery Performance

18 Copyright  2006 by CEBT Effect of Frequency Threshold  Data Set 1 IDS Lab. Seminar - 18Center for E-Business Technology Construction Time Index SizeQuery Time

19 Copyright  2006 by CEBT Conclusion  We developed VGRAM to improve performance of approximate string queries Variable-length grams, High Quality grams  We gave a full specification of the technique Index structure How to generate grams for a string using index structure Relationship btw the similarity of two strings and the similarity of their grams  We show how to adopt this technique in a variety of existing algorithms IDS Lab. Seminar - 19Center for E-Business Technology


Download ppt "VGRAM:Improving Performance of Approximate Queries on String Collections Using Variable- Length Grams VLDB 2007 Chen Li (UC, Irvine) Bin Wang (Northeastern."

Similar presentations


Ads by Google