Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.

Similar presentations


Presentation on theme: "Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern."— Presentation transcript:

1 Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern University, China

2 2 Approximate selection queries Keanu Reeves Samuel Jackson Schwarzenegger Samuel Jackson … Schwarrzenger Query errors: Limited knowledge about data Typos Limited input device (cell phone) input Data errors Typos Web data OCR Applications Spellchecking Query relaxation … Similarity functions: Edit distance Jaccard Cosine …

3 3 Performance is a big issue Answer queries interactively Many queries on a server 5ms/query20ms/query 200 queries/second50 queries/second

4 4 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments

5 5 q-grams b i n g o n 2-grams

6 6 q-gram inverted lists 2-grams id strings 123456123456 bingo bioinng bitingin biting boing going D0D0 gramstring ids bi1,2,3,4 bo5 gi3 go1,6 in1,2,3,3,4,5,6 io2 it3,4 ng1,2,3,4,5,6 nn2 oi2,5,6 ti3,4

7 7 Query processing 2-grams id strings 123456123456 bingo bioinng bitingin biting boing going ED(bingon, ?)≤1 D0D0 gramstring ids bi1,2,3,4 bo5 gi3 go1,6 in1,2,3,3,4,5,6 io2 it3,4 ng1,2,3,4,5,6 nn2 oi2,5,6 ti3,4 # of common grams >= 3

8 8 VGRAM: variable-length grams [VLDB07] [2,3]-gram dictionary b i n g o n gram bi bin bo gi go in ing io it ng nn oi ti

9 9 Adopting VGRAM in algorithms VGRAM gram dictionary string grams lower bound b i n g o n i n b n 4 o n 13 o n 10 n 3 i o n 11 t n 14 n 12 n n 15 n 5 g n n 16 n 6 i n 17 n 7 i n 18 n 1 t g n 24 g n 8 n 2 i o n 9 n 1919 n # n 20 # n 32 # n 21 # n 22 # n 23 # n 25 # n 26 # n 27 # n 28 # n 29 # n 30 # n 31 # n 33 # of common grams >= 3

10 10 Contributions of this study Tightening lower bounds using dynamic programming Cost-based quantitative approach Analyze and estimate query performance when adding each gram Automatically find high-quality grams Gram dictionary String collection High quality gram

11 11 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments

12 12 Calculating lower bound ed(s1,s2) <= k, then # of common grams >= # of s1 grams – k * q Fixed length (q) b i i n d i n g

13 13 Calculating lower bound b i i n d i n g 1 2 3 2 3 2 1 1 lower bound = # of grams of s1 – NAG(s1,k) Variable lengths

14 14 Too pessimistic? k -Max: Summation of k largest values NAG(s,2)=3+3=6 1 2 3 2 3 2 1 1 b i i n d i n g

15 15 Tightening lower bound Dynamic programming: tightening NAG(s,k) Subproblems: NAG(s[1, j ], i ) String s j 1 op i

16 16 Dynamic programming Recurrence function String s j 1 op i B[ j ] op i op i-1

17 17 Dynamic programming 1 2 3 2 3 2 1 1 b i i n d i n g 000000000 012333333 012345555 k =0 k =1 k =2 NAG vector

18 18 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments

19 19 Effects on inverted lists ab bc add gram abc Gram dictionary ab bc abc Gram dictionary string --abc-- --ab----bc--

20 20 Effects on query performance Decrease query’s inverted list Change lower bound Change # of candidates

21 21 Effects on query ’ s inverted lists ab bc add gram abc Gram dictionary ab bc abc Gram dictionary Query Q  Adding a new gram abc will not change or decrease the query’s inverted lists ------------- -----ab-----------abc-----

22 22 Effects on lower bound Query Q ----abcd----- ----abcd----- Query: Q, ED(Q, ?)≤1

23 23 Effects on # of candidates Change lower bound  change # of candidates Query Q ----abcd---- ab bc add gram abc Gram dictionary ab bc abc Gram dictionary ----abcd----

24 24 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries  Cost-based construction of gram dictionary Experiments

25 25 Construct a gram dictionary [VLDB07] q min =2 q max =4

26 26 Cost-base construction q min =2

27 27 Outline Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary  Experiments

28 28 Data sets Environment: GNU C++, Dell GX620 PC with an Intel Pentium 2.40Hz Dual Core CPU, 2GB memory, 250GB disk, Ubuntu (Linux) O.S. Index structure were assumed to be in memory Data setString #LengthRange of # of injected edit operations MinMaxAvg Article Titles277,000620766[1,6] Movie Titles855,000824935[1,3] Actor Names1,200,00047417[1,2]

29 29 Effect of Tightening Lower Bound 1M Actor names, Construct gram dictionary: 100,000 sample strings, 5000 queries, q min = 4

30 30 Comparison with algorithm Prune [VLDB07] Dataset: 1M article titles Prune: qmin=5, qmax=7, T=2000, LargeFirst policy GramGen: 1% sampling ratio, 2000 queries, (qmin=5 automatically determined)

31 31 Choosing q min Construct gram dictionary: (a) 3,000 queries, (b) sample ratio=2%

32 32 Conclusions Tightening lower bound Dynamic programming Analysis of adding a gram affects Index structure Performance of queries Efficient algorithm Automatically generating a high-quality gram dictionary

33 33 Thank you Questions or Comments?

34 34 Related work Approximate String Matching q-Grams, q-Samples Inside DBMS Substring matching Set similarity join Estimation Selectivity of SQL LIKE substring queries Approximate string answers


Download ppt "Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern."

Similar presentations


Ads by Google