Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.

Similar presentations


Presentation on theme: "1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara."— Presentation transcript:

1 1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara http://www.cs.ucsb.edu/~tamer

2 2 Whole/Substring Matching Problem Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1-2 % of database size). query string database string

3 3 String Similarity Motivation: Applications  Genetic sequence databases, NCBI  Text databases, spell checkers, web search.  Video databases (e.g. VIRAGE, MEDIA360) Database size is too large. Most of the techniques available are in-memory. Space requirement of current indexes is too large. Year Base Pairs (millions)

4 4 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion

5 5 Notation q : query string. m,n : length of strings. r : range query radius.  = r/|q|: error rate.

6 6 String Similarity: an example A C T - - T A G C R I I D A A T G A T A G -

7 7 Background Edit operations: Insert Delete Replace Edit distance (ED) between s 1 and s 2 = minimum number of edit operations to transform s 1 to s 2. Finding the edit distance is costly. O(mn) time and space if m and n are lengths of s 1 and s 2 if dynamic programming is used [NW70, SW81].

8 8 Related Work Lossless search Online  [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.  [WM92] (Wu, Manber) binary masks, O(rn).  [BYN99] (Beaze-Yates, Navarro) NFA Offline (index based)  [Mye94] (Myers) condensed r-neighborhood.  [BYN97] (Beaze-Yates, Navarro) dictionary. Lossy search [AG90] (Altschul, Gish) BLAST.  FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

9 9 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion

10 10 Frequency Vector Let s be a string from the alphabet  ={  1,...,   }. Let n i be the number of occurrences of the character  i in s for 1  i , then frequency vector: f(s) =[n 1,..., n  ]. Example: s = AATGATAG f(s) = [n A, n C, n G, n T ] = [4, 0, 2, 2]

11 11 Effect of Edit Operations on Frequency Vector Delete : decreases an entry by 1. Insert : increases an entry by 1. Replace : Insert + Delete Example: s = AATGATAG => f(s) = [4, 0, 2, 2] (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] (A  C), s = ACCTATAG => f(s) = [3, 2, 1, 2]

12 12 An Approximation to ED: Frequency Distance (FD 1 ) s = AATGATAG => f(s)=[4, 0, 2, 2] q = ACTTAGC => f(q)=[2, 2, 1, 2] pos = (4-2) + (2-1) = 3 neg = (2-0) = 2 FD 1 (f(s),f(q)) = 3 ED(q,s) = 4 FD 1 (f(s 1 ),f(s 2 ))=max{pos,neg}. FD 1 (f(s 1 ),f(s 2 ))  ED(s 1,s 2 ). f(q) FD 1 (f(q),f(s)) f(s)

13 13 An Illustration of Frequency Distance & Edit Distance Frequency Distance Set of strings 1 Set of strings 2 v1v1 v2v2 Edit Distance

14 14 Using Local Information: Wavelet Decomposition of Strings s = AATGATAC => f(s)=[4, 1, 1, 2] s = AATG + ATAC = s 1 + s 2 f(s 1 ) = [2, 0, 1, 1] f(s 2 ) = [2, 1, 0, 1]  1 (s)= f(s 1 )+f(s 2 ) = [4, 1, 1, 2]  2 (s)= f(s 1 )-f(s 2 ) = [0, -1, 1, 0]

15 15 Wavelet Decomposition of a String: General Idea A i,j = f(s(j2 i : (j+1)2 i -1)) B i,j = A i-1,2j - A i-1,2j+1  (s)= First wavelet coefficient Second wavelet coefficient

16 16 Wavelet Decomposition & ED Define FD(s 1,s 2 )=max{FD 1, FD 2 }.

17 17 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN and range queries Experimental results Conclusion

18 18 MRS-Index Structure Creation w=2 a transform s1s1

19 19 MRS-Index Structure Creation s1s1

20 20 MRS-Index Structure Creation s1s1

21 21 MRS-Index Structure Creation... s1s1 slide c times c=box capacity

22 22 MRS-Index Structure Creation s1s1...

23 23 MRS-Index Structure Creation... T a,1 s1s1 W=2 a

24 24 Using Different Resolutions... T a,1 s1s1 W=2 a... T a+1,1 W=2 a+1

25 25 MRS-Index Structure

26 26 MRS-index properties Relative MBR volume (Precision) decreases when c increases. w decreases. MBRs are highly clustered. Box volume Box Capacity

27 27 Outline Motivation & background Our contribution Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion

28 28 Range Queries [KS01] 208 1664128... w=2 4... w=2 5... w=2 6... w=2 7... s1s1 s2s2 sdsd 1=1= 2 12 1 3 23 2

29 29 k-Nearest Neighbor Query [KSF+96, SK98] k = 3

30 30 k-Nearest Neighbor Query k = 3 r = Edit distance to 3 rd closest substring

31 31 k-Nearest Neighbor Query k = 3 r

32 32 k-Nearest Neighbor Query k = 3

33 33 Outline Motivation & background Our contribution Experimental results Conclusion

34 34 Experimental Settings w={128, 256, 512, 1024}. Human chromosomes from ( www.ncbi.nlm.nih.gov ) www.ncbi.nlm.nih.gov chr02, chr18, chr21, chr22 Plotted results are from chr18 dataset. Queries are selected from data set randomly for 512  |q|  10000. An NFA based technique [BYN99] is implemented for comparison.

35 35 Experimental Results 1: Effect of Box Capacity (10-NN)

36 36 Experimental Results 2: Effect of Window Size (10-NN)

37 37 Experimental Results 3: k-NN queries

38 38 Experimental Results 4: Range Queries

39 39 Outline Motivation & background Our Contribution Experimental results Discussion & conclusion

40 40 Discussion In-memory (index size is 1-2% of the database size). Lossless search. 3 to 45 times faster than NFA technique for k- NN queries. 2 to 12 times faster than NFA technique for range queries. Can be used to speedup any previously defined technique.

41 41 Future Work Extend to weighted edit distance and affine gaps. Extend to local similarity (substring/substring) search. Compare the quality of answers and speed to BLAST (lossy search). Use as a preprocessing step to BLAST. Apply the MRS index structure for larger alphabet size (e.g. protein sequences.).

42 42 Related Work Lossless search Online  [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.  [WM92] (Wu, Manber) binary masks, O(rn).  [BYN99] (Beaze-Yates, Navarro) NFA Offline (index based)  [Mye94] (Myers) condensed r-neighborhood.  [BYN97] (Beaze-Yates, Navarro) dictionary. Lossy search [AG90] (Altschul, Gish) BLAST.  FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

43 43 Related Work (Similar problems) [BYP92] (Beaze-Yates, Perleberg) only replace is allowed. [Gus97] (Gusfield) exact matching, suffix trees. [JKS00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree.

44 44 THANK YOU

45 45 Frequency Distance to an MBR f(q) FD(f(q),f(s)) f(s) f(q) FD(f(q),B) B


Download ppt "1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara."

Similar presentations


Ads by Google