Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.

Similar presentations


Presentation on theme: "1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming."— Presentation transcript:

1 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

2 2 Example: a movie database StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies starred Schwarrzenger.

3 3 In general: Gap between Queries and Data Errors in the query The user doesn t remember a string exactly The user unintentionally types a wrong string … … Query: Schwarrzenger.Data : Schwarzenegger

4 4 Data may not clean Star Keanu Reeves Samuel Jackson Schwarzenegger Samuel Jackson Relation RRelation S Errors in the database: Data often is not clean by itself, especially true in data integration and cleansing Star Keanu Reeves Samuel L. Jackson Schwarzenegger Samuel L. Jackson

5 5 Query may include error

6 6 Problem definition: approximate string searches Samuel Jackson … Schwarzenegger Samuel Jackson Keanu Reeves Star Query q Collection of strings s Search Output: strings s that satisfy Sim(q,s)δ

7 7 A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Example Similarity Function: Edit Distance

8 8 Example: approximate string searches Tom J. Hanks … Ton Hank Thomas Hanks Tom Hank Star Query q Collection of strings s Search Tom Hanks Output: strings s that satisfy ed(q,s)2

9 9 Outline Problem motivation Preliminary Grams Inverted lists Merge algorithms Filtering technique Conclusion

10 10 String Grams q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) For example: 2-gram universal

11 11 Inverted lists Convert strings to gram inverted lists id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc

12 12 Main Example Query 1,2,3,4 0,1,2,4 Merge Final answers Data Grams stick (st,ti,ic,ck) Candidate string ids {1,2,3,4} {1,2,3} Double check for the real edit distance st ti ic ck count >=2 Performance bottleneck! idstrings 0rich 1stick 2stich 3stuck 4static ck ic st ta ti … 1,3 0,1,2,4 1,2,3,4 4 1,2,4 1,3 ed(s,q)1

13 13 Sub-problem definitions: Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences T.

14 14 Example Count threshold: 4 Result:

15 Outline Problem motivation Preliminary Merge algorithms Two previous algorithms Our proposed three algorithms Filtering technique Conclusion

16 16 Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkipDivideSkip

17 17 Two previous algorithms (1) Heap-based Algorithm Min-heap Count # of the occurrences of each element by a heap Push to heap ……

18 18 Example of HeapMerger [Sarawagi et al 2004] minHeap Count threshold

19 19 Five Merge Algorithms HeapMerger [Sarawagi 2004] MergeOpt [Sarawagi 2004 ] Previous New ScanCount MergeSkipDivideSkip

20 20 Two previous algorithms (2) MergeOpt Algorithm Long Lists: T-1Short Lists Binary search

21 21 Example of MergeOpt [Sarawagi et al 2004] Count threshold 4 Long Lists: 3 Short Lists: 2 Min-heap

22 22 Can we run faster?

23 23 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip

24 24 Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each element

25 25 ScanCount Example Result:13 Count threshold 4

26 26 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip

27 27 Our new algorithms (2) MergeSkip algorithm Min-heap …… Pop T-1 T-1 Jump

28 28 Example of MergeSkip Count threshold 4 minHeap

29 29 Example of MergeSkip Count threshold 4 minHeap

30 30 Example of MergeSkip Count threshold minHeap Pop 1, 5,10

31 31 Example of MergeSkip Count threshold minHeap Pop 1, 5,10 Jump 13

32 32 Example of HeapMerger Count threshold 4 13 minHeap Result:

33 33 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip

34 34 Our new algorithms (3) DivideSkip Algorithm Long Lists: dynamic sizeShort Lists Binary search MergeSkip

35 35 Size of long lists Long ListsShort Lists Binary search MergeOpt Cost: How many lists are treated as long lists?

36 36 Size of long lists Long ListsShort Lists Binary search MergeSkip Cost: How many lists are treated as long lists?

37 37 Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1)

38 38 Empirically verification Our formula about L achieves the best result over other options.

39 39 Experimental data sets Three real data sets have various string lengths and data sizes DBLP dataIMDB dataGoogle Web corpus

40 40 Performance (DBLP data) Running time per query with various algorithms DivideSkip is the best one

41 41 # of elements reading (DBLP data) DivideSkip skips reading the most elements DivideSkip is the best one

42 42 Outline Problem motivation Preliminary Merge algorithms Filtering technique Length, positional filter [Gravano et al. VLDB 2001] Filter tree Conclusion and future work

43 43 Length Filtering Ed(s,t) 2 s: t: Length: 19 Length: 10 By length only!

44 44 Positional Filtering Positional Gram For example: string abcd: {(ab,1),(bc,2),(cd,3)} ab ab Ed(s,t) 2 s t (ab,1) (ab,12)

45 45 Filter tree … Length level Gram level Position level Inverted list root 2 n1 3 … zyzz abaa 12m …

46 Surprising experimental results(DBLP) No filterLengthLength+Pos Heap MergeOpt ScanCount MergeSkip DivideSkip Wisely use filters, more filters may be bad!

47 Conclusion Three new merge algorithms We run faster Surprising experimental results Wisely use filters, more filters may be bad!

48 48 Thank you!

49 49 Backup : related work Approximate string matching [Navarro 2001] Varied length Grams [Li et al 2007] Fuzzy lookup in

50 50 Reference 1. [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik Efficient Exact Set-similarity Joins in VLDB [Chaudhuri 2003] S. Chaudhuri,K Ganjam, V. Ganti and R. Motwani Robust and Efficient Fuzzy Match for online Data Cleaning in SIGMOD [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava Approximate string joins in a database almost for free in VLDB 2001

51 51 Reference 4. [Li 2007] C. Li, B Wang and X. Yang VGRAM:Improving performance of approximate queries on string collections using variable- length grams in VLDB [Navarro 2001] G. Navarro, A guided tour to approximate string matching in Computing survey [Sarawagi 2004] S. Sarawagi and A. Kirpal, Efficient set joins on similarity predicates in ACM SIGMOD 2004


Download ppt "1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming."

Similar presentations


Ads by Google