Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville www.cise.ufl.edu/~jgvenkat VLDB 2006

2 Similarity Search Sequence Database, S........ Given threshold, find sequences similar to the query sequence. => Similar Sequences Sequence sisi sjsj sksk Query

3 Measure: Edit Distance P: ACGTACGTAC_GT | |||| ||| || Q: A_GTACCTACCGT Sequence Length: 12 3 Edit Operations: 2 Insertions and 1 Replace Edit Operations: Insert, Delete and Replace. Example: Edit Distance is the minimum number of edit operations needed to transform one sequence to another.

4 Edit Distance: Complexity Time and space complexity for computing Edit Distance between two sequences is O(n 2 ) Sequence Database, S |S| = 100,000........ One Sequence Comparison: 0.25 second. Time taken for single query: 7 hours. Query

5 Need for Indexing Sequence Database, S.... Select K sequences as references Query => Candidate Set, C Query (K + |C|) << |S| Pre-compute reference- to-sequence distances

6 Existing Methods Hierarchical Methods, VP-Tree (Yianilos, 1993) MVP-Tree (Bozkaya et al., 1997) M-Tree (Ciaccia et al., 1997), Slim-Tree (Traina et al., 2000), DF-Tree (Traina et al., 2002). DBM-Tree (Vieira et al., 2004) Omni (Filho et al., 2001) Frequency Vector (Kahveci et al., 2004).

7 Reference-based Indexing Reference Sequence Database Sequences Query Reference Circle Including Query

8 Reference-based Indexing Reference Sequence Database Sequences Query Sequences outside the reference circle (far from the reference) are pruned. Reference Circle Including Query Sequences close to the references can also be pruned

9 Reference-based Indexing Reference Sequence Database Sequences Query Reference Circle Excluding Query

10 Reference-based Indexing Reference Sequence Database Sequences Query Sequences inside the reference circle (close to the reference) are pruned. Reference Circle Excluding Query

11 Reference-based Indexing: Bounds Reference Sequence Database Sequence Query Lower Bound Upper Bound d2 d1 Given a sequence s, reference r and query q, Lower Bound: Minimum Distance between q and s with r as reference, |d1-d2|. Upper Bound: Maximum Distance between q and s with r as reference, d1+d2.

12 Observations Two types of pruning: Sequences close to references. Sequences far from references. A good reference set should be able to use both kinds of pruning effectively. Each reference should prune some part of the database not pruned by other references.

13 Outline Selection of References Reference Assignment Search Algorithm Experimental Results Conclusions

14 Our Contributions Selection of References: 1. Maximum Variance Selection: Reference with high variance of distance distributions with other sequences in the database. 2. Maximum Pruning: A Combinatorial approach of selecting the best reference set. Assignment of References: Each sequence has different set of references.

15 Selection of References: Maximum Variance (MV) Database Sequences Good Bad Basic Idea: Select references having more sequences close to and far from it, and hence can prune them.

16 Selection of References: Maximum Variance (MV) Select references having sequences close to and far away from them. References have maximum variance of distance distributions with other sequences in the database. New reference prunes some part of the database not pruned by existing set of references.

17 Maximum Variance: Algorithm S1 S2 S3 S4 S5 S6 S7 S8 S6 S3 S2 S1 S8 S7 S5 S4 S3 S5 S7 S8 Sequence Database Random Subset of Sequences S1 Variance of Distance Distributions => Compute Distances Sort Remove Sequences Close to or Far away from New Reference Candidate Reference Set S1 S2 S3 S4 S5 S6 S7 S8

18 Maximum Variance: Example f g e a d c b e g c a f b d Maximum Variance Ordering Database Sequences Reference Sequences

19 Selection of References: Maximum Pruning (MP) Combinatorial approach to select the best reference set for given query set. Select reference set that can prune more sequences over all queries. Sample query set Q’ following the actual query distribution is given. Sampling techniques to reduce the complexity of this method.

20 Maximum Pruning: Algorithm S1 S2 S3 S4 S5 S6 Q1 Q2 Q3 Q4 v1v2v3v4 Reference Set Sample Queries, Q’ Candidate References S1 S2 S3 S4 S5 S6 Sequence Database GAINS

21 Maximum Pruning: Algorithm S1 S2 S3 S4 S5 S6 Q1 Q2 Q3 Q4 v1v2v3v4 Reference Set Sample Queries, Q’ Candidate References S1 S2 S3 S4 S5 S6 Sequence Database GAINS

22 Maximum Pruning: Algorithm S1 S2 S3 S4 S5 S6 Q1 Q2 Q3 Q4 S1v2v3v4 Reference Set Sample Queries, Q’ Candidate References S1 S2 S3 S4 S5 S6 Sequence Database GAINS

23 Maximum Pruning: Algorithm S1 S2 S3 S4 S5 S6 Q1 Q2 Q3 Q4 v1S1v3v4 Reference Set S1 S2 S3 S4 S5 S6 Sequence Database GAINS Sample Queries, Q’ Candidate References

24 Maximum Pruning: Algorithm S1 S2 S3 S4 S5 S6 Q1 Q2 Q3 Q4 v1v2S1v4 Reference Set S1 S2 S3 S4 S5 S6 Sequence Database GAINS Sample Queries, Q’ Candidate References

25 Maximum Pruning: Algorithm S1 S2 S3 S4 S5 S6 Q1 Q2 Q3 Q4 v1v2v3S1 Reference Set S1 S2 S3 S4 S5 S6 Sequence Database GAINS Sample Queries, Q’ Candidate References

26 Maximum Pruning: Algorithm S1 S2 S3 S4 S5 S6 Q1 Q2 Q3 Q4 S1v2v3v4 Reference Set S1 S2 S3 S4 S5 S6 Sequence Database G1 GAINS Sample Queries, Q’ Candidate References

27 Maximum Pruning: Algorithm S1 S2 S3 S4 S5 S6 Q1 Q2 Q3 Q4 v1v2S2v4 Reference Set S1 S2 S3 S4 S5 S6 Sequence Database G1 G2 GAINS Sample Queries, Q’ Candidate References

28 Maximum Pruning: Algorithm S1 S2 S3 S4 S5 S6 Q1 Q2 Q3 Q4 v1v2S2v4 Reference Set S1 S2 S3 S4 S5 S6 Sequence Database G1 G2 G3 G4 G5 G6 GAINS MAX() Repeat Until MAX() > 0 Sample Queries, Q’ Candidate References

29 Maximum Pruning Example {a,b,c,d,e,f} {a,b,c,d} Candidate Reference Gain a2 q c a b b1 b3 d f b2 ed Reference Set Database Sequences Reference Sequences Sequences pruned by a a {a,d,e,f} e

30 Maximum Pruning Example {a,b,c,d,e,f} {a,b,c,d} Candidate Reference Gain a2 c1 f1 q c a b b1 b3 d e f b2 ed Reference Set Database Sequences Reference Sequences Sequences pruned by a a {a,d,e,f}

31 Outline Selection of References Assignment of References Search Algorithm Experimental Results Conclusions

32 Assignment of References Sequence Database, S.... Select K sequences as references Query => Candidate Set, C Query (K + |C|) << |S| Pre-compute reference- to-sequence distances.... => Candidate Set, C’ Increase the Number of references to m Assign K references to each sequence Query (m + |C’|) < (K + |C|) << |S|

33 Reference Assignment: Example q1q1 c a b b a1 d e f q2q2 q3q3 a c References for b b a2 b c1 Number of References = 2

35 Search Algorithm Reference set, V Sequence Database, S Query, q Pre-compute Sequence-Reference Distances Compute Query-Reference Distances Lower Bounds Upper Bounds MAX(LB)MIN(UB) If MAX(LB) ≤ ε ≤ MIN(UB), add s to Candidate set, If ε > MIN(UB), add s to Result set. If ε < MAX(LB), add s to Pruned set.

37 Experimental Setup Datasets DNA: Alphabet size of 4 and 20000 sequences. Protein: Alphabet size of 20 and 4000 sequences of up to 500 amino acids. Text: Alphabet size of 36 and 8000 sequences of length 100 each. Size of Reference Set, m = 200. Experiments, Comparison with our methods Maximum Variance with same and different reference sets (MV-S and MV-D). Maximum Pruning with same and different reference sets (MP-S and MP-D). Comparison with other methods Frequency Vector (Kahveci et al., 2004). Omni (Filho et al., 2001) Others: M-Tree (Ciaccia et al., 1007), Slim-Tree (Traina et al., 2000), DBM-Tree (Vieira et al., 2004) and DF-Tree (Traina et al., 2002).

38 Comparison of Our Methods DNA Dataset k = 4

39 Comparison of Our Methods DNA Dataset Range = 8

40 Comparison with Other Methods DNA Dataset, k = 16

41 Conclusion References selected by Maximum Variance and Maximum Pruning eliminates more database sequences as compared to existing selection strategies. Assigning different reference set to each sequence dramatically improves the performance. MP-D outperforms existing methods in almost all the experiments.

Questions ? Thank You jgvenkat@cise.ufl.edu

43 Comparison with Other Methods: Protein Dataset Query Range = 300

44 Assignment of References: Memory Limitations Main memory stores pre-computed reference-to- sequence distances along with the references. For each [s,v i ] pair (s S, v i V), store [i,ED(s,v i )] (Takes 8 bytes). Given the available main memory in bytes, B B = 8KN + zm N: Number of sequences in the database. K: Number of references per sequence. z: Size of each sequence in bytes. m: Number of references in reference set. Example: Given B = 1 GB, N = 10 million, z = 100 and m = 1000, then K = 13.

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

Similar presentations

Presentation on theme: "Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

Similar presentations

Presentation on theme: "Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville."— Presentation transcript:

Similar presentations

About project

Feedback