Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.

Similar presentations


Presentation on theme: "A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng."— Presentation transcript:

1 A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

2 Outline Introduction Background Preliminary Method Experiment

3 Introduction Given a Query and database. Do local alignment Smith-Waterman : Guaranteed to find all local alignment. Expensive BLAST FASTA

4 Improvement Hardware: more investment on computer,CPU Software Phil Green’s SWAT appeal to sparsity and some machine-level coding tricks 60% of dynamic programming matrix has value 0 Avoiding computing most of these unproductive entries

5 Focus on improving protein similarity searches This approach examines and compute only 4% of the underlying dynamic programming matrix

6 Recall Sequence alignment  Local sequence alignment  Global sequence alignment Goal – matching path with highest score Table-based computation and dynamic programming

7 Dynamic Programming Three basic components  Recurrence relation  Tabular computation  Traceback

8 Smith-Waterman Method Dynamic programming algorithm Find the most similar subsequences of two sequences Problem  Lots of computation  will be googol  Programmer  will be crazy and excite  Why?  how to accelerate

9 Background Scoring System  Simple scoring scheme  Affine gap penalty scoring scheme  PAM120 (PAMn)  BLOSUM62 (BLOSUMn)

10 Simple Scoring Scheme Match (e.g. +8) Mismatch (e.g. -5) Gap constant penalty (e.g. -20)

11 Affine Gap Penalty Scoring Scheme Match (e.g. +8) Mismatch (e.g. -5) Gap symbol (e.g. -5) Gap open penalty (e.g. -10)

12 PAM PAM – Percent Accepted Mutation  Dayhoff et al. (1978) PAM unit  Evolutionary time corresponding to average of 1 mutation per 100 residues  1% accepted PAMn  Relates to mutation probabilities in evolutionary interval of n PAM units Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf

13 PAM120 Source: http://eta.embl-heidelberg.de:8000/misc/mat/pam120.html

14 BLOSUM62 BLOSUM – BLOcks SUbstitution Matrix  Steven and Jorga G. Henikoff (1992)  Paper: Amino acid substitution matrices from protein blocks [PubMed]PubMed BLOSUMn  Relates to mutation probabilities observed between pairs of related proteins that diverged so above n% identity Some information from: http://www.apl.jhu.edu/~przytyck/CAMS_2004_1b.pdf

15 BLOSUM62 CSTPAGNDEQHRKMILVFYW C9 -30 -4-3 -2 S41 101000 0 -2 -3 T141 101000 0 -2 -3 P 17 -2 -2 -2-3 -2-4-3-4 A01 40 -2 -2 -2 -3 G 01-206 -2 -3-4 0-3 -2 N-310-2 0610000-2-3 -2-4 D-301-21620 -2-3 -4-3 -4 E 00 -20252001 -3 -2-3 Q 00 -200250110-3-2 -3-2 H-30-2 110080-2-3 -22-2 R-3 -2-20 01052-3-2-3 -2-3 K 00 -2011 25 -3-2-3 -2-3 M -2-3-2-3-20 512-20 I -2 -3-4-3 14210-3 L-2 -3-4-3-4-3-2-3-2 22430-2 V-2 0-3 -2 -3 -21314 -3 F-2 -4-2-3 -3 000631 Y-2 -3-2-3-2-3-22-2 372 W-2-3 -4-3-2-4 -3-2 -3 -3-2-31211

16 Preliminaries Σ : sequences are composed |Σ| × |Σ| Substitution matrix S giving the score Uniform gap penalty g > 0 Query = q 1 q 2 . . . q p of P letters Target = t 1 t 2 . . . t n of N letters Threshold T > 0

17 Score Table  Edit Graph Picture source: http://searchlauncher.bcm.tmc.edu/help/Pictures/S-Wexample.gif

18

19 Problem Find a high score local alignment between Query and Target whose path score ≧ T Edit-graph figure1 Limit our attention to prefix-positive paths If there is a path of score T or greater in the edit graph then there is a prefix positive path of score T or greater

20 Definition A set P of index-value pairs { (i,v): i is [0,P]

21 The start and extension tables Consider a vertex x in row j of the edit graph of Query vs. Target

22

23 Start Trimming Limiting the dynamic programming to the startable vertices requires a table Start(w) where w = |Σ| ks

24 Start Trimming Worst case Let αbe the expected percentage of vertices that are seed

25 Extension Trimming A table that eliminates vertices that are not extendable (i,j) is extendable vertex iff C(i,j)>Extend(i,Target[j+1…j+ke])

26 Extension Trimming

27

28 A Table-Driven Scheme for DP Goal: to restrict the SW computation to productive vertices Jump table – captures the effect of Advance and Delete over k J > 0 rows  space  unmanageably large  But only record those for which

29 Jump table Start table Space-saving version for Jump and Start tables

30 Check for paths scoring T or more 

31

32 Recall – Affine Gap Penalty Score  Match  Mismatch  Gap symbol - gsp  Gap open penalty - gop Affine cost of gap of length k  g + kh, g = gop, h = gsp

33 Diagram of Affine Gap Penalty CI D CI D CI D CI D -h -g-h -h δ(a i,b j ) Source: kmchao’s lecture note

34 Recurrence system - Gotoh

35 The Case of Affine Gap Costs Simple scoring scheme  affine gap penalty scheme Affine edit graph and vertex structure Question: how to modify the equations defined above?

36

37 Recurrence System for Affine Gap Costs Two observations  To compute the j th row form the (j-1) st requires knowing only the vectors of and values in row j-1, and not on the values in that row  If then the value at vertex need not be recorded as any maximal path through its will have score less than the maximal path passing through the corresponding

38 Recurrence System

39 Results

40 Experiment Method  Edit graph based approach vs. SWAT Scoring matrix  PAM120 Affine gap cost  8+4n Database (target)  3 million residue subset of the PIR database Query  A periodic clock protein of length 173 (pcp)  A lactate dehydrogenase of length 319 (dehydro)  A cGMP kinase of length 670 (kinase)  A growth factor of length 1210 (g factor)

41 PAM120 & Gap Cost 8+4n

42 BLOSUM62 & Gap Cost 8+2n

43 Thanks for Your Attention Ending


Download ppt "A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng."

Similar presentations


Ads by Google