Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jiaheng Lu, University of California, Irvine

Similar presentations


Presentation on theme: "Jiaheng Lu, University of California, Irvine"— Presentation transcript:

1 Efficient Merging and Filtering Algorithms for Approximate String Searches
Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

2 Example: a movie database
Find movies starred Schwarrzenger. Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Schwarzenegger The Terminator 1984 The man 2006 Crime

3 In general: Gap between Queries and Data
Errors in the query The user doesn’t remember a string exactly The user unintentionally types a wrong string Query: Schwarrzenger. Data : Schwarzenegger

4 Data may not clean Errors in the database:
Data often is not clean by itself, especially true in data integration and cleansing Relation R Relation S Star Keanu Reeves Samuel L. Jackson Schwarzenegger Star Keanu Reeves Samuel Jackson Schwarzenegger

5 Query may include error

6 Problem definition: approximate string searches
Collection of strings s Star Search Keanu Reeves Samuel Jackson Query q Schwarzenegger Samuel Jackson Output: strings s that satisfy Sim(q,s)≤δ

7 Example Similarity Function: Edit Distance
A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

8 Example: approximate string searches
Collection of strings s Star Search Tom Hank Thomas Hanks Query q Ton Hank Tom Hanks Tom J. Hanks Output: strings s that satisfy ed(q,s)≤2

9 Outline Problem motivation Preliminary Merge algorithms
Grams Inverted lists Merge algorithms Filtering technique Conclusion

10 String  Grams q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) u n i v
For example: 2-gram u n i v e r s a l (un),(ni),(iv),(ve),(er),(rs),(sa),(al) 10 10

11 Inverted lists at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich
Convert strings to gram inverted lists 4 2 3 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static

12 Performance bottleneck!
Main Example st 1,2,3,4 Merge Candidate string ids {1,2,3,4} Query ed(s,q)≤1 ti 1,2,4 (st,ti,ic,ck) stick ic 0,1,2,4 count >=2 ck 1,3 Double check for the real edit distance Grams Data ck ic st ta ti 1,3 id strings rich 1 stick 2 stich 3 stuck 4 static Final answers 0,1,2,4 Performance bottleneck! {1,2,3} 1,2,3,4 4 1,2,4

13 Sub-problem definitions:
Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences ≥ T.

14 Example Count threshold: 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 13

15 Outline Problem motivation Preliminary Merge algorithms
Two previous algorithms Our proposed three algorithms Filtering technique Conclusion

16 Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
[Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkip DivideSkip

17 Two previous algorithms (1)
Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap

18 Example of HeapMerger [Sarawagi et al 2004]
1 minHeap 10 5 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

19 Five Merge Algorithms MergeOpt [Sarawagi 2004] HeapMerger ScanCount
Previous New ScanCount MergeSkip DivideSkip

20 Two previous algorithms (2)
MergeOpt Algorithm Binary search Long Lists: T-1 Short Lists

21 Example of MergeOpt [Sarawagi et al 2004]
Min-heap 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold ≥ 4

22 Can we run faster?

23 Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
Previous New ScanCount MergeSkip DivideSkip

24 Use an array to record # of occurrences of each element
Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each element

25 ScanCount Example Count threshold ≥ 4 1 2 4 Result:13
1 2 4 Result:13 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

26 Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
Previous New ScanCount MergeSkip DivideSkip

27 Our new algorithms (2) …… MergeSkip algorithm T-1 Pop T-1 Min-heap
Jump T-1

28 Example of MergeSkip Count threshold ≥ 4 minHeap 1 3 5 10 13 10 13 15
7 13 13 15 Count threshold ≥ 4

29 Example of MergeSkip Count threshold ≥ 4 minHeap 1 5 10 13 15 1 3 5 10
7 13 13 15 Count threshold ≥ 4

30 Example of MergeSkip Count threshold ≥ 4 Pop 1, 5,10 minHeap 13 15 1 3
7 13 13 15 Count threshold ≥ 4

31 Example of MergeSkip Count threshold ≥ 4 Pop 1, 5,10 minHeap Jump ≥ 13
15 1 3 5 10 13 10 13 15 5 7 13 13 15 Jump ≥ 13 Count threshold ≥ 4

32 Example of HeapMerger Count threshold ≥ 4 minHeap Result:13 13 13 13
15 1 3 5 10 13 10 13 15 5 7 13 13 15 Result:13 Count threshold ≥ 4

33 Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
Previous New ScanCount MergeSkip DivideSkip

34 Long Lists: dynamic size
Our new algorithms (3) DivideSkip Algorithm MergeSkip Binary search Long Lists: dynamic size Short Lists

35 Size of long lists How many lists are treated as long lists? Cost:
MergeOpt Binary search Long Lists Short Lists 35

36 Size of long lists How many lists are treated as long lists? Cost:
MergeSkip Binary search Long Lists Short Lists 36

37 Decide L value A good balance in the tradeoff:
# of long lists = T / ( μ logM +1) 37 37

38 Empirically verification
Our formula about “L” achieves the best result over other options. 38

39 Experimental data sets
Three real data sets have various string lengths and data sizes DBLP data IMDB data Google Web corpus

40 Performance (DBLP data)
DivideSkip is the best one Running time per query with various algorithms

41 # of elements reading (DBLP data)
DivideSkip is the best one DivideSkip skips reading the most elements

42 Outline Problem motivation Preliminary Merge algorithms
Filtering technique Length, positional filter [Gravano et al. VLDB 2001] Filter tree Conclusion and future work

43 Length Filtering s: t: Length: 10 By length only! Ed(s,t) ≤ 2

44 Positional Filtering s Ed(s,t) ≤ 2 a b t a b Positional Gram
For example: string abcd: {(ab,1),(bc,2),(cd,3)} Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12)

45 Filter tree … … root 2 n 1 3 zy zz ab aa m Length level Gram level
Position level 5 12 17 28 44 Inverted list

46 Surprising experimental results(DBLP)
No filter Length Length+Pos Heap 115.42 11.98 3.64 MergeOpt 14.22 1.40 6.78 ScanCount 30.91 2.68 2.14 MergeSkip 10.12 1.09 2.65 DivideSkip 2.23 0.76 1.96 Wisely use filters, more filters may be bad!

47 Conclusion Three new merge algorithms Surprising experimental results
We run faster Surprising experimental results Wisely use filters, more filters may be bad!

48 Thank you!

49 Backup : related work Approximate string matching Fuzzy lookup in
[Navarro 2001] Fuzzy lookup in Varied length Grams [Li et al 2007]

50 Reference [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001

51 Reference 4. [Li 2007] C. Li, B Wang and X. Yang “VGRAM:Improving performance of approximate queries on string collections using variable-length grams ” in VLDB 2007 5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001 6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004


Download ppt "Jiaheng Lu, University of California, Irvine"

Similar presentations


Ads by Google