# Jiaheng Lu, University of California, Irvine

## Presentation on theme: "Jiaheng Lu, University of California, Irvine"— Presentation transcript:

Efficient Merging and Filtering Algorithms for Approximate String Searches
Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

Example: a movie database
Find movies starred Schwarrzenger. Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Schwarzenegger The Terminator 1984 The man 2006 Crime

In general: Gap between Queries and Data
Errors in the query The user doesn’t remember a string exactly The user unintentionally types a wrong string Query: Schwarrzenger. Data : Schwarzenegger

Data may not clean Errors in the database:
Data often is not clean by itself, especially true in data integration and cleansing Relation R Relation S Star Keanu Reeves Samuel L. Jackson Schwarzenegger Star Keanu Reeves Samuel Jackson Schwarzenegger

Query may include error

Problem definition: approximate string searches
Collection of strings s Star Search Keanu Reeves Samuel Jackson Query q Schwarzenegger Samuel Jackson Output: strings s that satisfy Sim(q,s)≤δ

Example Similarity Function: Edit Distance
A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

Example: approximate string searches
Collection of strings s Star Search Tom Hank Thomas Hanks Query q Ton Hank Tom Hanks Tom J. Hanks Output: strings s that satisfy ed(q,s)≤2

Outline Problem motivation Preliminary Merge algorithms
Grams Inverted lists Merge algorithms Filtering technique Conclusion

String  Grams q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) u n i v
For example: 2-gram u n i v e r s a l (un),(ni),(iv),(ve),(er),(rs),(sa),(al) 10 10

Inverted lists at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich
Convert strings to gram inverted lists 4 2 3 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static

Performance bottleneck!
Main Example st 1,2,3,4 Merge Candidate string ids {1,2,3,4} Query ed(s,q)≤1 ti 1,2,4 (st,ti,ic,ck) stick ic 0,1,2,4 count >=2 ck 1,3 Double check for the real edit distance Grams Data ck ic st ta ti 1,3 id strings rich 1 stick 2 stich 3 stuck 4 static Final answers 0,1,2,4 Performance bottleneck! {1,2,3} 1,2,3,4 4 1,2,4

Sub-problem definitions:
Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences ≥ T.

Example Count threshold: 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 13

Outline Problem motivation Preliminary Merge algorithms
Two previous algorithms Our proposed three algorithms Filtering technique Conclusion

Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
[Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkip DivideSkip

Two previous algorithms (1)
Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap

Example of HeapMerger [Sarawagi et al 2004]
1 minHeap 10 5 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

Five Merge Algorithms MergeOpt [Sarawagi 2004] HeapMerger ScanCount
Previous New ScanCount MergeSkip DivideSkip

Two previous algorithms (2)
MergeOpt Algorithm Binary search Long Lists: T-1 Short Lists

Example of MergeOpt [Sarawagi et al 2004]
Min-heap 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold ≥ 4

Can we run faster?

Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
Previous New ScanCount MergeSkip DivideSkip

Use an array to record # of occurrences of each element
Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each element

ScanCount Example Count threshold ≥ 4 1 2 4 Result:13
1 2 4 Result:13 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
Previous New ScanCount MergeSkip DivideSkip

Our new algorithms (2) …… MergeSkip algorithm T-1 Pop T-1 Min-heap
Jump T-1

Example of MergeSkip Count threshold ≥ 4 minHeap 1 3 5 10 13 10 13 15
7 13 13 15 Count threshold ≥ 4

Example of MergeSkip Count threshold ≥ 4 minHeap 1 5 10 13 15 1 3 5 10
7 13 13 15 Count threshold ≥ 4

Example of MergeSkip Count threshold ≥ 4 Pop 1, 5,10 minHeap 13 15 1 3
7 13 13 15 Count threshold ≥ 4

Example of MergeSkip Count threshold ≥ 4 Pop 1, 5,10 minHeap Jump ≥ 13
15 1 3 5 10 13 10 13 15 5 7 13 13 15 Jump ≥ 13 Count threshold ≥ 4

Example of HeapMerger Count threshold ≥ 4 minHeap Result:13 13 13 13
15 1 3 5 10 13 10 13 15 5 7 13 13 15 Result:13 Count threshold ≥ 4

Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip
Previous New ScanCount MergeSkip DivideSkip

Long Lists: dynamic size
Our new algorithms (3) DivideSkip Algorithm MergeSkip Binary search Long Lists: dynamic size Short Lists

Size of long lists How many lists are treated as long lists? Cost:
MergeOpt Binary search Long Lists Short Lists 35

Size of long lists How many lists are treated as long lists? Cost:
MergeSkip Binary search Long Lists Short Lists 36

Decide L value A good balance in the tradeoff:
# of long lists = T / ( μ logM +1) 37 37

Empirically verification
Our formula about “L” achieves the best result over other options. 38

Experimental data sets
Three real data sets have various string lengths and data sizes DBLP data IMDB data Google Web corpus

Performance (DBLP data)
DivideSkip is the best one Running time per query with various algorithms

# of elements reading (DBLP data)
DivideSkip is the best one DivideSkip skips reading the most elements

Outline Problem motivation Preliminary Merge algorithms
Filtering technique Length, positional filter [Gravano et al. VLDB 2001] Filter tree Conclusion and future work

Length Filtering s: t: Length: 10 By length only! Ed(s,t) ≤ 2

Positional Filtering s Ed(s,t) ≤ 2 a b t a b Positional Gram
For example: string abcd: {(ab,1),(bc,2),(cd,3)} Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12)

Filter tree … … root 2 n 1 3 zy zz ab aa m Length level Gram level
Position level 5 12 17 28 44 Inverted list

Surprising experimental results(DBLP)
No filter Length Length+Pos Heap 115.42 11.98 3.64 MergeOpt 14.22 1.40 6.78 ScanCount 30.91 2.68 2.14 MergeSkip 10.12 1.09 2.65 DivideSkip 2.23 0.76 1.96 Wisely use filters, more filters may be bad!

Conclusion Three new merge algorithms Surprising experimental results
We run faster Surprising experimental results Wisely use filters, more filters may be bad!

Thank you!

Backup : related work Approximate string matching Fuzzy lookup in
[Navarro 2001] Fuzzy lookup in Varied length Grams [Li et al 2007]

Reference [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001

Reference 4. [Li 2007] C. Li, B Wang and X. Yang “VGRAM:Improving performance of approximate queries on string collections using variable-length grams ” in VLDB 2007 5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001 6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004

Similar presentations