Jiaheng Lu, University of California, Irvine

Slides:



Advertisements
Similar presentations
Ting Chen, Jiaheng Lu, Tok Wang Ling
Advertisements

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
1 A B C
Scenario: EOT/EOT-R/COT Resident admitted March 10th Admitted for PT and OT following knee replacement for patient with CHF, COPD, shortness of breath.
Simplifications of Context-Free Grammars
Angstrom Care 培苗社 Quadratic Equation II
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
David Burdett May 11, 2004 Package Binding for WS CDL.
We need a common denominator to add these fractions.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Break Time Remaining 10:00.
Factoring Quadratics — ax² + bx + c Topic
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
Bright Futures Guidelines Priorities and Screening Tables
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Association Rule Mining
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Artificial Intelligence
Before Between After.
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1.
Select a time to count down from the clock above
Copyright Tim Morris/St Stephen's School
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Efficient Approximate Search on String Collections Part I
Presentation transcript:

Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu

Example: a movie database Find movies starred Schwarrzenger. Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Schwarzenegger The Terminator 1984 The man 2006 Crime

In general: Gap between Queries and Data Errors in the query The user doesn’t remember a string exactly The user unintentionally types a wrong string Query: Schwarrzenger. Data : Schwarzenegger … …

Data may not clean Errors in the database: Data often is not clean by itself, especially true in data integration and cleansing Relation R Relation S Star Keanu Reeves Samuel L. Jackson Schwarzenegger Star Keanu Reeves Samuel Jackson Schwarzenegger

Query may include error

Problem definition: approximate string searches Collection of strings s Star Search Keanu Reeves Samuel Jackson Query q Schwarzenegger Samuel Jackson … Output: strings s that satisfy Sim(q,s)≤δ

Example Similarity Function: Edit Distance A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

Example: approximate string searches Collection of strings s Star Search Tom Hank Thomas Hanks Query q Ton Hank Tom Hanks Tom J. Hanks … Output: strings s that satisfy ed(q,s)≤2

Outline Problem motivation Preliminary Merge algorithms Grams Inverted lists Merge algorithms Filtering technique Conclusion

String  Grams q-grams (un),(ni),(iv),(ve),(er),(rs),(sa),(al) u n i v For example: 2-gram u n i v e r s a l (un),(ni),(iv),(ve),(er),(rs),(sa),(al) 10 10

Inverted lists at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich Convert strings to gram inverted lists 4 2 3 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static

Performance bottleneck! Main Example st 1,2,3,4 Merge Candidate string ids {1,2,3,4} Query ed(s,q)≤1 ti 1,2,4 (st,ti,ic,ck) stick ic 0,1,2,4 count >=2 ck 1,3 Double check for the real edit distance Grams Data ck ic st ta ti … 1,3 id strings rich 1 stick 2 stich 3 stuck 4 static Final answers 0,1,2,4 Performance bottleneck! {1,2,3} 1,2,3,4 4 1,2,4

Sub-problem definitions: Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences ≥ T.

Example Count threshold: 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 13

Outline Problem motivation Preliminary Merge algorithms Two previous algorithms Our proposed three algorithms Filtering technique Conclusion

Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkip DivideSkip

Two previous algorithms (1) Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap

Example of HeapMerger [Sarawagi et al 2004] 1 minHeap 10 5 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

Five Merge Algorithms MergeOpt [Sarawagi 2004] HeapMerger ScanCount Previous New ScanCount MergeSkip DivideSkip

Two previous algorithms (2) MergeOpt Algorithm Binary search Long Lists: T-1 Short Lists

Example of MergeOpt [Sarawagi et al 2004] Min-heap 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold ≥ 4

Can we run faster?

Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip Previous New ScanCount MergeSkip DivideSkip

Use an array to record # of occurrences of each element Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each element

ScanCount Example Count threshold ≥ 4 1 2 4 Result:13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 4 Result:13 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4

Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip Previous New ScanCount MergeSkip DivideSkip

Our new algorithms (2) …… MergeSkip algorithm T-1 Pop T-1 Min-heap Jump T-1

Example of MergeSkip Count threshold ≥ 4 minHeap 1 3 5 10 13 10 13 15 7 13 13 15 Count threshold ≥ 4

Example of MergeSkip Count threshold ≥ 4 minHeap 1 5 10 13 15 1 3 5 10 7 13 13 15 Count threshold ≥ 4

Example of MergeSkip Count threshold ≥ 4 Pop 1, 5,10 minHeap 13 15 1 3 7 13 13 15 Count threshold ≥ 4

Example of MergeSkip Count threshold ≥ 4 Pop 1, 5,10 minHeap Jump ≥ 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Jump ≥ 13 Count threshold ≥ 4

Example of HeapMerger Count threshold ≥ 4 minHeap Result:13 13 13 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Result:13 Count threshold ≥ 4

Five Merge Algorithms HeapMerger MergeOpt ScanCount MergeSkip Previous New ScanCount MergeSkip DivideSkip

Long Lists: dynamic size Our new algorithms (3) DivideSkip Algorithm MergeSkip Binary search Long Lists: dynamic size Short Lists

Size of long lists How many lists are treated as long lists? Cost: MergeOpt Binary search Long Lists Short Lists 35

Size of long lists How many lists are treated as long lists? Cost: MergeSkip Binary search Long Lists Short Lists 36

Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1) 37 37

Empirically verification Our formula about “L” achieves the best result over other options. 38

Experimental data sets Three real data sets have various string lengths and data sizes DBLP data IMDB data Google Web corpus

Performance (DBLP data) DivideSkip is the best one Running time per query with various algorithms

# of elements reading (DBLP data) DivideSkip is the best one DivideSkip skips reading the most elements

Outline Problem motivation Preliminary Merge algorithms Filtering technique Length, positional filter [Gravano et al. VLDB 2001] Filter tree Conclusion and future work

Length Filtering s: t: Length: 10 By length only! Ed(s,t) ≤ 2

Positional Filtering s Ed(s,t) ≤ 2 a b t a b Positional Gram For example: string abcd: {(ab,1),(bc,2),(cd,3)} Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12)

Filter tree … … root 2 n 1 3 zy zz ab aa m Length level Gram level Position level 5 12 17 28 44 Inverted list

Surprising experimental results(DBLP) No filter Length Length+Pos Heap 115.42 11.98 3.64 MergeOpt 14.22 1.40 6.78 ScanCount 30.91 2.68 2.14 MergeSkip 10.12 1.09 2.65 DivideSkip 2.23 0.76 1.96 Wisely use filters, more filters may be bad!

Conclusion Three new merge algorithms Surprising experimental results We run faster Surprising experimental results Wisely use filters, more filters may be bad!

Thank you!

Backup : related work Approximate string matching Fuzzy lookup in [Navarro 2001] Fuzzy lookup in Varied length Grams [Li et al 2007]

Reference [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001

Reference 4. [Li 2007] C. Li, B Wang and X. Yang “VGRAM:Improving performance of approximate queries on string collections using variable-length grams ” in VLDB 2007 5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001 6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004