Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chen Li ( 李晨 ) Chen Li Search As You Type Joint work with colleagues at UCI and Tsinghua.

Similar presentations


Presentation on theme: "Chen Li ( 李晨 ) Chen Li Search As You Type Joint work with colleagues at UCI and Tsinghua."— Presentation transcript:

1 Chen Li ( 李晨 ) Chen Li Search As You Type Joint work with colleagues at UCI and Tsinghua.

2 Demos  http://www.cs.stanford.edu/ “Search” Box http://www.cs.stanford.edu/  Try “garcia molina”  Try “garcia monila”  http://directory.uci.edu/: Try “venkatasubramanian” http://directory.uci.edu/  http://psearch.ics.uci.edu/ http://psearch.ics.uci.edu/  http://fr.ics.uci.edu/haiti/ http://fr.ics.uci.edu/haiti/  http://www.miamiherald.com/news/americas/haiti/c onnect/ http://www.miamiherald.com/news/americas/haiti/c onnect/  http://ipubmed.ics.uci.edu/ http://ipubmed.ics.uci.edu/

3 Too many results! Traditional Keyword Search No result! Complicated and still no result!

4 Interactive Fuzzy Keyword Search

5 What’s new? Search on apple.com Query: “itune” Missing result! Query: “itunes music”

6 Challenge: performance!  < 100 ms: server processing, network, javascript, etc  Requirement for high query throughput  20 queries per second (QPS)  50ms/query (at most)  100 QPS  10ms/query  Other challenges: ranking, space requirements, …

7 Two Features (Focus of this talk)  Fuzzy Search: finding results with approximate keywords  Full-text: find results with query keywords (not necessarily adjacently)

8 8  Ed(s1, s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 s1: v e n k a t s u b r a m a n i a n s2: w e n k a t s u b r a m a n i a n ed(s1, s2) = 1 Edit Distance 8

9 Problem Setting  Data  R: a set of records  W: a set of distinct words  Query  Q = {p 1, p 2, …, p l }: a set of prefixes  δ: Edit-distance threshold  Query result  R Q : a set of records such that each record has all query prefixes or their similar forms

10 Feature 1: Fuzzy Search

11 Formulation Record Strings wenkatsubra  Find strings with a prefix similar to a query keyword  Do it incrementally! venkatasubramanian carey jain nicolau smith Query:

12 Observation  Strings = {exam, example, exemplar, exempt, sample}  Edit-distance threshold δ = 2 PrefixDistance exam2 examp1 exampl0 example1 exemp2 exempt2 exempl1 exempla2 sampl2 PrefixDistance examp2 exampl1 example0 exempl2 exempla2 sample2 delete e match e delete e replace e with a match e Q’ = examplQ = example

13 Trie Indexing Computing set of active nodes Φ Q  Initialization  Incremental step e x a m p l $ $ e m p l a r $ t $ s a m p l e $ PrefixDistance examp2 exampl1 example0 exempl2 exempla2 sample2 Active nodes for Q = example e 2 1 0 2 2 2

14 Initialization  Q = ε e x a m p l $ $ e m p l a r $ t $ s a m p l e $ PrefixDistance 0 11 22 PrefixDistance 0 e1 ex2 s1 sa2 PrefixDistance ε 0 Initializing Φ ε with all nodes within a depth of δ e

15 Incremental Algorithm: Overview Access their leaf nodes as answers.

16 e Incremental Computation: Example  Q = e e x a m p l $ $ e m p l a r $ t $ s a m p l e $ PrefixDistance ε 0 e1 ex2 s1 sa2 Prefix# OpBaseOp ε 1 ε del e s1 ε sub e/s e0 ε mat e ex1 ε ins x exa2 ε Ins xa exe2 ε Ins xe Prefix# OpBaseOpPrefix# OpBaseOp ε 1 ε del e Prefix# OpBaseOp ε 1 ε del e s1 ε sub e/s Prefix# OpBaseOp ε 1 ε del e s1 ε sub e/s e0 ε mat e 1 10 1 22 e2edel e ex2esub e/xex3 del e exa3exsub e/a exe2exmat e s2sdel e sa2ssub e/asa3 del e Active nodes for Q = ε Active nodes for Q = e 2

17 Incremental Computation: Algorithm  Incremental computation from Φ Q’ to Φ Q  add( Φ Q, ) has effect only if there exists no active node in Φ Q with the same n and smaller d FOR EACH FROM Φ Q’ Deletion add( Φ Q, ) SubstitutionFOR EACH n’ FROM non-matching children of n add( Φ Q, ) Match add( Φ Q, ) (m is the matching child of n) InsertionFOR EACH m’ FROM descendents of m add( Φ Q, ) (x is the distance from m’ to m) Algorithm Details

18 Feature 2: Full-text search  Find answers with query keywords  Not necessarily adjacently

19 Multi-Prefix Intersection  Q = vldb li IDRecord 1Li data… 2data… 3data Lin… 4Lu Lin Luis… 5Liu… 6VLDB Lin data… 7VLDB… 8Li VLDB… 6VLDB Lin data… 8Li VLDB… d a t a $ l i nu $ u $ v l d b $ 12361236 5 4 678678 $ 346346 i s $ 1818 $ 4

20 Multi-Prefix Intersection: Method 1 IDRecord 1Li data… 2data… 3data Lin… 4Lu Lin Luis… 5Liu… 6VLDB Lin data… 7VLDB… 8Li VLDB… d a t a $ l i nu $ u $ v l d b $ 12361236 5 4 678678 $ 346346 i s $ 1818 $ 4 1 3 4 5 6 8 6 7 8 li vldb 6 8  Q = vldb li Space costInverted index Time costUnion + intersection More efficient intersection approaches…

21 Multi-Prefix Intersection: Method 2 Forward List 1 2 1 1 3 3 5 6 4 1 3 7 7 2 7 d a t a $ l i nu $ u $ v l d b $ 12361236 5 4 678678 $ 346346 i s $ 1818 $ 4 IDRecord 1Li data… 2data… 3data Lin… 4Lu Lin Luis… 5Liu… 6VLDB Lin data… 7VLDB… 8Li VLDB… [1, 7] [1, 1] [2, 6] [2, 4] 1 2 34 5 67 [3, 3][4, 4] [5, 6] [6, 6] [7, 7]  Q = vldb li 678678 [2, 4] Read eachVerify/Probe 6VLDB Lin data…1 3 7 8Li VLDB…2 7 Space costInverted + forward index Time costProbing forward lists

22 Traversing inverted lists incrementally  Compute and cache only needed answers  For subsequent queries, compute the answers:  from the cached answers  from resuming previously terminated computation Q = cs co cached answers of cs co traversal list: inverted list of cs compute Q = cs conf Verify cached answers of cs conf Compute

23 Experimental Results  Computing similar prefixes

24 Multi-prefix intersection

25 Time Scalability

26 Index scalability

27 Other Features  Synonyms  Ranking

28 Conclusions  New data-access paradigm: Search as you type  Many interesting and challenging problems. http://tastier.ics.uci.edu/


Download ppt "Chen Li ( 李晨 ) Chen Li Search As You Type Joint work with colleagues at UCI and Tsinghua."

Similar presentations


Ads by Google