VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University, China

2 Approximate string selections Keanu Reeves Samuel Jackson Schwarzenegger Samuel Jackson … Schwarrzenger Query errors: Limited knowledge about data Typos Limited input device (cell phone) input Data errors Typos Web data OCR Applications Spellchecking Query relaxation …

3 Approximate string joins R informix microsoft … … S infromix … mcrosoft … Record linkage Edit distance Jaccard Cosine …

4 Goal Reducing index size (memory) Reducing running time

5 “ q-grams ” of strings u n i v e r s a l 2-grams

6 q-gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3

7 # of common grams >= 3 Searching using inverted lists Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 23 0 1 4 2 0 13 0124 4 124 3 3 ti iccksh ht ti ic ck

8 # of common grams >= 1 2-grams -> 3-grams? Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck tic icksht hti tic ick 4 2 4 1 2 0 1 0 3 4 1 3 4 2 3 id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static

9 Outline Motivation VGRAM Main idea Decomposing strings to grams Choosing good grams Effect of edit operations on grams Adopting vgram in existing algorithms Experiments

10 Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams  Shorter lists Smaller # of common grams of similar strings id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3

11 Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio

12 VGRAM: Main idea Grams with variable lengths (between q min and q max ) zebra ze(123) corrasion co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms

13 Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms?

14 Challenge 1: String  Variable-length grams? Fixed-length 2-grams Variable-length grams u n i v e r s a l ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l

15 Representing gram dictionary as a trie ni ivr sal uni vers

16 Challenge 2: Constructing gram dictionary st  0, 1, 3 sti  0, 1 stu  3 stic  0, 1 stuc  3 Step 1: Collecting frequencies of grams with length in [qmin, qmax] Gram trie with frequencies

17 Step 2: selecting grams Pruning trie using a frequency threshold T (e.g., 2)

18 Step 2: selecting grams (cont) Threshold T = 2

19 Final gram dictionary [2,4]-grams

20 Outline Motivation VGRAM Main idea Decomposing strings to grams Choosing good grams  Effect of edit operations on grams Adopting vgram in existing algorithms Experiments

21 Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q

22 Deletion affects variable-length grams i-q max +1i+q max - 1 Deletion Not affected Affected i

23 Grams affected by a deletion ni ivr sal uni vers [2,4]-grams u n i v e r s a l i-q max +1i+q max - 1 Deletion i Affected? Deletion Affected?

24 Grams affected by a deletion (cont) i-q max +1i+q max - 1 Deletion i Affected? Trie of gramsTrie of reversed grams

25 # of grams affected by each operation _ u _ n _ i _ v _ e _ r _ s _ a _ l _ 01 11 12 12 22 11 12 11 1 10 Deletion/substitutionInsertion

26 Max # of grams affected by k operations Vector of s = With 2 edit operations, at most 4 grams can be affected Called NAG vector (# of affected grams) Precomputed and stored

27 Summary of VGRAM index

28 Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s  grams String s1, s2 such that ed(s1,s2) <= k  min # of their common grams

29 Lower bound on # of common grams If ed(s1,s2) =: (|s 1 |- q + 1) – k * q u n i v e r s a l Fixed length (q) Variable lengths: # of grams of s1 – NAG(s1,k)

30 Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 124 12 1 0 4 3 … ck ic … ti … Lower bound = 3 Lower bound = 1 sh ht tick 24 1 4 0 1 1 2 3 … ck ic ich … tic tick … 2-4 grams2-grams tick id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static

31 PartEnum + VGRAM PartEnum, fixed q-grams: ed(s1,s2) <= k  hamming(grams(s1),grams(s2)) <= k * q VGRAM: ed(s1,s2) <= k  hamming(VG (s1),VG(s2)) <= NAG(s1,k) + NAG(s2,k)

32 PartEnum + VGRAM (naïve) Bm(R) = max(NAG(r,k)) R Bm(S) = max(NAG(s,k)) S Both are using the same gram dictionary. Use Bm(R) + Bm(S) as the new hamming bound.

33 PartEnum + VGRAM (optimization) R Bm(S) = max(NAG(s,k)) S Group R based on the NAG(r,k) values Join(R1,S) using Bm(R1) + Bm(S) Similarly, Join(R2,S), Join(R3,S) Local bounds tighter  better signatures generated Grouping S also possible. R1 with Bm(R1) R2 with Bm(R2) R3 with Bm(R3)

34 Outline Motivation VGRAM Main idea Decomposing strings to grams Choosing good grams Effect of edit operations on grams Adopting vgram in existing algorithms Experiments

35 Data sets Data set 1: Texas Real Estate Commission. 151K person names, average length = 33. Data set 2: English dictionary from the Aspell spellchecker for Cygwin. 149,165 words, average length = 8. Data set 3: DBLP Bibliography. 277K titles, average length = 62.

36 VGRAM overhead (index size) Dataset 3: DBLP titles

37 VGRAM overhead (construction time) Dataset 3: DBLP titles

38 Benefits over fixed-length grams (index) Dataset 1: Person names

39 Benefits over fixed-length grams (running time) Dataset 1: Person names

40 Effect of q max Dataset 1: Person names

41 Effect of frequency threshold T Dataset 1: Person name

42 Improving algorithm ProbeCount Dataset 1: Person name

43 Improving algorithm ProbeCluster Dataset 1: Person name

44 Improving algorithm PartEnum Dataset 1: Person name

45 Discussions Dynamic maintenance Edit distance variants Approximate substring queries Block moves Using VGRAM in DBMS

46 Conclusions VGRAM: using grams of variable-length high-quality Adoptable in existing algorithms Reduce index size Reduce running time

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,

Similar presentations

Presentation on theme: "VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,

Similar presentations

Presentation on theme: "VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,"— Presentation transcript:

Similar presentations

About project

Feedback