Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.

Similar presentations


Presentation on theme: "Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng."— Presentation transcript:

1 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng Lu 2 1 University of California, Irvine 2 Renmin University of China

2 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Data Cleaning Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008 Should clearly be “Niels Bohr”

3 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Record Linkage NameHobbiesAddress Brad Pitt…… Forest Whittacker…… George Bush…… Angelina Jolie…… Arnold Schwarzenegger…… PhoneAgeName ……Brad Pitt ……Arnold Schwarzeneger ……George Bush ……Angelina Jolie ……Forrest Whittaker No exact match!

4 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Query Relaxation http://www.google.com/jobs/britney.html Actual queries gathered by Google

5 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai What is Approximate String Search? Query against collection: Find entries similar to “Arnold Schwarseneger” What do we mean by similar to? - Edit Distance - Jaccard Similarity - Cosine Similarity - Dice - Etc. How can we support these types of queries efficiently? String Collection Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzenegger …

6 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approximate Query Answering irvine 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Sliding Window

7 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approximate Query Example Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne} tfviirefrvneun in …… Lookup Grams 2-grams 134579134579 5959 1515 12391239 3939 7979 569569 Inverted Lists (stringIDs) 1245612456 Count >= 3  Candidates = {1, 5, 9} May have false positives 134579134579 1515 12391239 7979 569569

8 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai T-Occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge

9 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Compression Inverted Index >> Source Data Fit in memory? Space Budget?

10 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Motivation: Related Work  IR: lossless compression of inverted lists (disk-based)  Delta representation + compact encoding  Inverted lists in memory: decompression overhead  Tune compression ratio?  Overcome these limitations in our setting?

11 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Main Contributions Two lossy compression techniques  Answer queries exactly  Index fits into a space budget  Queries  faster on the compressed indexes  Flexibility to choose space / time tradeoff  Existing list-merging algorithms: re-use + compression specific optimizations

12 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview  Motivation & Preliminaries  Approach 1: Discarding Lists  Approach 2: Combining Lists  Experiments & Conclusion

13 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approach 1: Discarding Lists tfviirefrvneun in …… 2-grams 134579134579 5959 1515 12391239 3939 7979 569569 Inverted Lists (stringIDs) 1245612456 Lists discarded, “Holes”

14 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Effects on Queries  Decrease lower bound T on common grams  Smaller T  more false positives  T <= 0  “panic”, scan entire string collection  Surprise Fewer lists  Faster Queries (depends)

15 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai sha han ang ngh gha hai ter … Query “shanghai”, Edit Distance 1 3-grams {sha, han, ang, ngh, gha, hai} uni ing 3-grams Hole grams Regular grams Basis: Edit Operations “destroy” q=3 grams No Holes: T = #grams – ed * q = 6 – 1 * 3 = 3 With holes: T’ = T – #holes = 0  Panic! Really destroy q=3 grams per edit operation? Dynamic Programming for tighter T

16 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Discard  Good choice depends on query workload  Space budget: Many combinations of grams  Make a “reasonable” choice efficiently? Effect on Query Unaffected  Panic Slower or Faster

17 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Discard INPUT: Space Budget, Inverted lists, Workload OUTPUT: Lists to discard tfviirefrvneun in …… Query1 Query2 Query3 … Total estimated running time t Estimated impact ∆t Incremental Update Choose one list at a time ALGORITHM: Greedy & Cost-Based

18 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Estimating Query Times List-Merging: cost function, offline with linear regression Panic: #strings * avg similarity time Post-Processing: #candidates * avg similarity time

19 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Estimating #candidates Incremental-ScanCount Algorithm 2 3 0 1 4 0 12 3 4 2 2 0 0 3 0 12 3 4 Counts StringIDs Counts StringIDs Decrement un 134134 List to Discard BEFORE T = 3 #candidates = 2 AFTER T’ = T-1 = 2 #candidates = 3

20 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview  Motivation & Preliminaries  Approach 1: Discarding Lists  Approach 2: Combining Lists  Experiments & Conclusion

21 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Approach 2: Combining Lists tfviirefrvneun in …… 2-grams 134579134579 5959 569569 12391239 139139 7979 6969 Inverted Lists (stringIDs) 1245612456 Lists combined

22 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Effects on Queries  Lower bound T is unchanged (no new panics)  Lists become longer:  More time to traverse lists  More false positives

23 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speeding Up Queries Query 3-grams {sha, han, ang, ngh, gha, hai} combined lists refcount = 2 combined lists refcount = 3 Traverse physical lists once. Count for stringIDs increases by refcount.

24 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Choosing Lists to Combine  Discovering candidate gram pairs  Frequent q+1-grams  correlated adjacent q-grams  Locality-Sensitive Hashing (LSH)  Selecting candidate pairs to combine  Basis: estimated cost on query workload  Similar to DiscardLists  Different Incremental ScanCount algorithm

25 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Overview  Motivation & Preliminaries  Approach 1: Discarding Lists  Approach 2: Combining Lists  Experiments & Conclusion

26 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Experiments Datasets:  Google WebCorpus Word Grams  IMDB Actors  DBLP Titles Overview:  Performance & Scalability of DiscardLists & CombineLists  Comparison with IR compression & VGRAM  Changing workloads 10k Queries: Zipf distributed, from dataset q=3, Edit Distance=2, (also Jaccard & Cosine)

27 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Experiments DiscardLists CombineLists Runtime decreases!

28 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Comparison with IR compression Carryover-12 Uncompressed Compressed

29 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Comparison with variable-length grams, VGRAM Uncompressed Compressed

30 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Future Work  Combine: DiscardLists, CombineLists and IR compression  Filters for partitioning, global vs. local decisions  Dealing with updates to index

31 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Conclusions Two lossy compression techniques  Answer queries exactly  Index fits into a space budget  Queries  faster on the compressed indexes  Flexibility to choose space / time tradeoff  Existing list-merging algorithms: re-use + compression specific optimizations

32 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Thank You! This work is part of The Flamingo Project http://flamingo.ics.uci.edu

33 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai More Experiments What if the workload changes from the training workload?

34 Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai More Experiments What if the workload changes from the training workload?


Download ppt "Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng."

Similar presentations


Ads by Google