Download presentation

Presentation is loading. Please wait.

Published byAbigail Kerr Modified over 3 years ago

1
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

2
2 Example: a movie database StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies starred Schwarrzenger.

3
3 In general: Gap between Queries and Data Errors in the query The user doesn t remember a string exactly The user unintentionally types a wrong string … … Query: Schwarrzenger.Data : Schwarzenegger

4
4 Data may not clean Star Keanu Reeves Samuel Jackson Schwarzenegger Samuel Jackson Relation RRelation S Errors in the database: Data often is not clean by itself, especially true in data integration and cleansing Star Keanu Reeves Samuel L. Jackson Schwarzenegger Samuel L. Jackson

5
5 Query may include error

6
6 Problem definition: approximate string searches Samuel Jackson … Schwarzenegger Samuel Jackson Keanu Reeves Star Query q Collection of strings s Search Output: strings s that satisfy Sim(q,s)δ

7
7 A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Example Similarity Function: Edit Distance

8
8 Example: approximate string searches Tom J. Hanks … Ton Hank Thomas Hanks Tom Hank Star Query q Collection of strings s Search Tom Hanks Output: strings s that satisfy ed(q,s)2

9
9 Outline Problem motivation Preliminary Grams Inverted lists Merge algorithms Filtering technique Conclusion

10
10 String Grams q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) For example: 2-gram universal

11
11 Inverted lists Convert strings to gram inverted lists id strings rich stick stich stuck static grams at ch ck ic ri st ta ti tu uc

12
12 Main Example Query 1,2,3,4 0,1,2,4 Merge Final answers Data Grams stick (st,ti,ic,ck) Candidate string ids {1,2,3,4} {1,2,3} Double check for the real edit distance st ti ic ck count >=2 Performance bottleneck! idstrings 0rich 1stick 2stich 3stuck 4static ck ic st ta ti … 1,3 0,1,2,4 1,2,3,4 4 1,2,4 1,3 ed(s,q)1

13
13 Sub-problem definitions: Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences T.

14
14 Example Count threshold: 4 Result:

15
Outline Problem motivation Preliminary Merge algorithms Two previous algorithms Our proposed three algorithms Filtering technique Conclusion

16
16 Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkipDivideSkip

17
17 Two previous algorithms (1) Heap-based Algorithm Min-heap Count # of the occurrences of each element by a heap Push to heap ……

18
18 Example of HeapMerger [Sarawagi et al 2004] minHeap Count threshold

19
19 Five Merge Algorithms HeapMerger [Sarawagi 2004] MergeOpt [Sarawagi 2004 ] Previous New ScanCount MergeSkipDivideSkip

20
20 Two previous algorithms (2) MergeOpt Algorithm Long Lists: T-1Short Lists Binary search

21
21 Example of MergeOpt [Sarawagi et al 2004] Count threshold 4 Long Lists: 3 Short Lists: 2 Min-heap

22
22 Can we run faster?

23
23 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip

24
24 Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each element

25
25 ScanCount Example Result:13 Count threshold 4

26
26 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip

27
27 Our new algorithms (2) MergeSkip algorithm Min-heap …… Pop T-1 T-1 Jump

28
28 Example of MergeSkip Count threshold 4 minHeap

29
29 Example of MergeSkip Count threshold 4 minHeap

30
30 Example of MergeSkip Count threshold minHeap Pop 1, 5,10

31
31 Example of MergeSkip Count threshold minHeap Pop 1, 5,10 Jump 13

32
32 Example of HeapMerger Count threshold 4 13 minHeap Result:

33
33 Five Merge Algorithms HeapMergerMergeOpt Previous New ScanCount MergeSkipDivideSkip

34
34 Our new algorithms (3) DivideSkip Algorithm Long Lists: dynamic sizeShort Lists Binary search MergeSkip

35
35 Size of long lists Long ListsShort Lists Binary search MergeOpt Cost: How many lists are treated as long lists?

36
36 Size of long lists Long ListsShort Lists Binary search MergeSkip Cost: How many lists are treated as long lists?

37
37 Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1)

38
38 Empirically verification Our formula about L achieves the best result over other options.

39
39 Experimental data sets Three real data sets have various string lengths and data sizes DBLP dataIMDB dataGoogle Web corpus

40
40 Performance (DBLP data) Running time per query with various algorithms DivideSkip is the best one

41
41 # of elements reading (DBLP data) DivideSkip skips reading the most elements DivideSkip is the best one

42
42 Outline Problem motivation Preliminary Merge algorithms Filtering technique Length, positional filter [Gravano et al. VLDB 2001] Filter tree Conclusion and future work

43
43 Length Filtering Ed(s,t) 2 s: t: Length: 19 Length: 10 By length only!

44
44 Positional Filtering Positional Gram For example: string abcd: {(ab,1),(bc,2),(cd,3)} ab ab Ed(s,t) 2 s t (ab,1) (ab,12)

45
45 Filter tree … Length level Gram level Position level Inverted list root 2 n1 3 … zyzz abaa 12m …

46
Surprising experimental results(DBLP) No filterLengthLength+Pos Heap MergeOpt ScanCount MergeSkip DivideSkip Wisely use filters, more filters may be bad!

47
Conclusion Three new merge algorithms We run faster Surprising experimental results Wisely use filters, more filters may be bad!

48
48 Thank you!

49
49 Backup : related work Approximate string matching [Navarro 2001] Varied length Grams [Li et al 2007] Fuzzy lookup in

50
50 Reference 1. [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik Efficient Exact Set-similarity Joins in VLDB [Chaudhuri 2003] S. Chaudhuri,K Ganjam, V. Ganti and R. Motwani Robust and Efficient Fuzzy Match for online Data Cleaning in SIGMOD [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava Approximate string joins in a database almost for free in VLDB 2001

51
51 Reference 4. [Li 2007] C. Li, B Wang and X. Yang VGRAM:Improving performance of approximate queries on string collections using variable- length grams in VLDB [Navarro 2001] G. Navarro, A guided tour to approximate string matching in Computing survey [Sarawagi 2004] S. Sarawagi and A. Kirpal, Efficient set joins on similarity predicates in ACM SIGMOD 2004

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google