Presentation is loading. Please wait.

Presentation is loading. Please wait.

05.04.2008 SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.

Similar presentations


Presentation on theme: "05.04.2008 SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich."— Presentation transcript:

1 05.04.2008 SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich | Gerry Kammerer

2 05.04.2008 Gerry Kammerer – ETH Zürich 2 Human Genome

3 05.04.2008 Gerry Kammerer – ETH Zürich 3 Outline  Introduction  DNA and DNA sequences  The problem and some approaches  The SSAHA-approach  Conclusions

4 05.04.2008 Gerry Kammerer – ETH Zürich 4 Outline  Introduction  DNA and DNA sequences  The problem and some approaches  The SSAHA-approach  Conclusions

5 05.04.2008 Gerry Kammerer – ETH Zürich 5 DNA  Deoxyribonucleic acid  Contains genetic instructions  Double helix  Long polymer of simple units (Nucleotides)  Backbone made of sugars and phospate  Four types of molecules attached to each sugar  Sequence of these four bases encodes information

6 05.04.2008 Gerry Kammerer – ETH Zürich 6 DNA sequence  Base Pair  Bases from each strand form bonds  DNA sequence  Succession of letters  Adenine, Cytosine, Guanine, Thymine  Measured in Giga base (Gb) or Giga base pairs (Gbp)

7 05.04.2008 Gerry Kammerer – ETH Zürich 7 The Problem  Sequence comparison (exact / approx)  Through comparison: Make conclusions on -Structure -Function -Cooperation of components  Sequence specifying  Produce multiple megabytes of data / day  Big amount of queries/data: Overexert Techniques -Results not found in reasonable time / not exact enough

8 05.04.2008 Gerry Kammerer – ETH Zürich 8 Approaches  Dynamic Programming (First approaches)  Needleman & Wunsch, 1970  Refinements: Smith & Waterman, 1981 (most popular)  BLAST (Basic Local Alignment Search Tool)  Altschul et al., 1990  Faster / less accurate  Family of programs  Suffix Tree Algorithms  Need to much memory

9 05.04.2008 Gerry Kammerer – ETH Zürich 9 Outline  Introduction  DNA and DNA sequences  The problem and some approaches  The SSAHA-approach  Conclusions

10 05.04.2008 Gerry Kammerer – ETH Zürich 10 SSAHA-approach  Use hash table structures  Need much memory (Nowadays we have more RAM!)  But significantly less than suffix tree methods!  3 - 4 orders of magnitude faster than BLAST

11 05.04.2008 Gerry Kammerer – ETH Zürich 11 Definitions  Query Q = „GGATCCCCTG“  DB = S 1, S 2, S 3, S 4,... (DNA sequences)  k-tuple: 4-tuple = „GGAT“  S has (n – k + 1) (overlapping) k-tuples  (i, j) references k-tuple -i is index of sequence -J is offset in the sequence  2-tuple (2,3) Example DB: S1 = „GGATCCCCTG“ S2 = „TGCAACAT“ S3 = „AACATCCTGGG“

12 05.04.2008 Gerry Kammerer – ETH Zürich 12 Hash table construction  K-tuples  Only 4 k (as we have four bases)  List of postions L  Positions of k-tuples (sorted by k-tuple)  Array A  Pointers into L  (Which positions in L belong to which k-tuples)

13 05.04.2008 Gerry Kammerer – ETH Zürich 13 Hash table construction (ctd.) Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ List of positions L: Array A: A = [0,6,10,14]  A = 0 C = 6 G = 10 T = 14 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2) 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5)

14 05.04.2008 Gerry Kammerer – ETH Zürich 14 Sequence Search  Query Q = „GAAT...“ – DNA sequence  Proceed each k-tuple base-by-base  E.g. with 2-tuple: „GA“, „AA“, „AT“,...  Construct hits: (i,k,j)  i, j is position for the current k-tuple (from hash table)  k = (j – (offset of current k-tuple in Q))  n entries in DB = n hits

15 05.04.2008 Gerry Kammerer – ETH Zürich 15 Sequence Search (ctd.)  Sorting the hits  (i,k,j) – First by i, then k, then j  Let us have a look at a small example! Query Q = „AT“

16 05.04.2008 Gerry Kammerer – ETH Zürich 16 Remember Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ List of positions L: Array A: A = [0,6,10,14]  A = 0 C = 6 G = 10 T = 14 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2)

17 05.04.2008 Gerry Kammerer – ETH Zürich 17 Sequence Search Example Example DB (1-tuples) List of positions L: 8: (2,3) 9: (2,5) 10:(3,2) 11:(1,0) 12:(1,1) 13:(2,1) 14:(1,3) 15:(1,1) 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Hits: (1,2,2) (2,3,3) (2,4,4) (3,0,0) (3,1,1) (3,3,3)

18 05.04.2008 Gerry Kammerer – ETH Zürich 18 Sequence Search Example (ctd.) Example DB (1-tuples) List of positions L: 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Hits: (1,2,2)(1,2,3) (2,3,3)(1,0,1) (2,4,4)(3,1,2) (3,0,0) (3,1,1) (3,3,3) 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2) Hits: (1,2,2) (2,3,3) (2,4,4) (3,0,0) (3,1,1) (3,3,3)

19 05.04.2008 Gerry Kammerer – ETH Zürich 19 Sequence Search Example (ctd.) Example DB (1-tuples) List of positions L: 0:(1,2) 1:(2,3) 2:(2,4) 3:(3,0) 4:(3,1) 5:(3,3) 6:(1,4) 7:(1,5) Query Q = „AT“ Proceed base-by-base Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0) 8: (2,3) 9: (2,5) 10:(1,0) 11:(1,1) 12:(2,1) 13:(1,3) 14:(1,1) 15:(3,2)

20 05.04.2008 Gerry Kammerer – ETH Zürich 20 Sequence Search Example (ctd.) Query Q = „AT“ Proceed base-by-base Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0) Example DB (1-tuples): S1 = „GGATCC“ S2 = „TGCAAC“ S3 = „AATA“ Same i,k in Hits: Run of matching bases Example DB (1-tuples) Sorted Hits: (1,0,1)(3,1,1) (1,2,2)(3,1,2) (1,2,3)(3,3,3) (2,3,3) (2,4,4) (3,0,0)

21 05.04.2008 Gerry Kammerer – ETH Zürich 21 Sequence Search Summary  Run of matching bases  Region of exact matches  Gapped matches  Only finds in forward direction!  Reverse query to find in reward direction 3-tuples, 9-base query Hits: (3,9,9)(5,3,3) (3,9,12)(5,3,9) (3,9,15)

22 05.04.2008 Gerry Kammerer – ETH Zürich 22 Memory Requirements  Array A: 4 * 4 k = 4 k+1 bytes  32 bit pointers, 4 k possible k-tuples  List L: 8 * W bytes  W = Number of k-tuples in database  Reduce Memory usage  Only consider non-overlapping k-tuples  Discard highly frequent k-tuples  Loss of accuracy!

23 05.04.2008 Gerry Kammerer – ETH Zürich 23 Search speed  Search speed depends on  T hash Building Hash-tables  T search Processing a specific query  T hash does not matter much Computed once for one DB (save to disk, server usage)

24 05.04.2008 Gerry Kammerer – ETH Zürich 24 Optimise Search speed  Sorting algorithm  In reality: Lies close to linear with quicksort  Parameters k and W (tradeoff with accuracy)  Increase k (loss of sensitivity)  Reduce W by cutoff very often occuring k-tuples  Strong effect! (There exists highly repetitive k-tuples)

25 05.04.2008 Gerry Kammerer – ETH Zürich 25 Experimental results (from paper)  2.7 Gb of human genome DNA  292‘016 sequences  177 Query sequences  Containing 104‘755 bases  Compaq EV6 500MHz Processor, 16 GB RAM

26 05.04.2008 Gerry Kammerer – ETH Zürich 26 Experimental results (ctd.) 90%95%100% kT hash T search T hash T search T hash T search 10824.0s102.5s842.4s128.8s868.5s389.5s 11798.3s26.3s810.5s36.1s808.8s199.1s 12952.2s7.3s969.9s11.0s961.2s119.0s 13850.8s2.2s859.14.5s851.4s78.7s 14914.1s0.9s932.0s2.5s927.1s51.6s 15996.0s0.1s1015.5s1.7s999.2s35.4s

27 05.04.2008 Gerry Kammerer – ETH Zürich 27 Outline  Introduction  DNA and DNA sequences  The problem and some approaches  The SSAHA-approach  Conclusions

28 05.04.2008 Gerry Kammerer – ETH Zürich 28 Reasons for fastness  Hashing the database  Nearly independent from database size  BLAST e.g. hashes query and scans DB  Human genome far from random  Discard highly repetitive k-tuples has big effect

29 05.04.2008 Gerry Kammerer – ETH Zürich 29 Conclusions  Computers improved quickly  Cheaper, more powerful  More RAM available  Hash the database

30 05.04.2008 Gerry Kammerer – ETH Zürich 30 Questions?


Download ppt "05.04.2008 SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich."

Similar presentations


Ads by Google