Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.

Similar presentations


Presentation on theme: "Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000."— Presentation transcript:

1 Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000 Similarity search Lecture by Terry Speed

2 Pairwise sequence alignment by the dynamic programming algorithm. The algorithm involves finding the optimal path in the path matrix. (a), which is equivalent to searching the optimal solution in the search tree (b). (a) Path Matrix(b) Search Tree AIMS A M O S Alignment AIM-S A-MOS Pruning by an optimization function XX..............

3 Methods for computing the optimal score in the dynamic programming algorithm (a ) the gap penalty is a constant. (b) the gap penalty is a linear function of the gap length. (a) (b) D i, j-l d D i-1, j D i-1, j-1 D i-1, j D i, j-l d w s(i), t(j) D i,j D i, j (2) b w s(i), t(j) D i,j (1) D i,j (3) b

4 Concepts of global and local optimality in the pairwise sequence alignment. The distinction is made as to how the initial values are assigned to the path matrix. (a) Global vs. Global (b) Local vs. Global 0 0 0...... 0....0....0....0....0 X (c) Local vs. Local

5 Dynamic programming to find edit distances - Edit operation: M, R, I, D - Edit transcript: A string over the alphabet M, R, I, D that describes a transformation of one string into another. Example: R D I M D M M A - T H S A - R T - S - Edit (Levens(h)tein) distance: The minimum number of edit operations necessary to transform one string into another. (Note: matches are not counted.) Example: R D I M D M 1+ 1+ 1+ 0+ 1+ 0 = 4

6 The recurrence - Stage: position in the edit transcript; - State: I, D, M, or R; - Optimal value function: D(i, j) where D(i, j) = edit distance of Seq 1 [1...i] and Seq 2 [1...j] - Recurrence relation: 1 +D(i-1, j) D(i, j) = min1 +D(i, j-1) t(i, j) +D(i-1, j-1), where t(i, j) = {

7 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)01234 0 M1 A2 T3 H4 S5

8 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)012340 M1 A2 T3 H4 S5

9 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)01234 001 M1 A2 T3 H4 S5

10 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)01234 0012 M1 A2 T3 H4 S5

11 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)01234 001234 M11 A22 T33 H44 S55

12 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)01234 001234 M111 A22 T33 H44 S55

13 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)01234 001234 M1112 A22 T33 H44 S55

14 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)01234 001234 M111234 A221234 T33 H44 S55

15 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)01234 001234 M111234 A221234 T332223 H44 S55

16 The tabulation, D(i, j) Seq 2 (j)ARTS Seq 1 (i)01234 001234 M111234 A221234 T332223 H443333 S554443

17 The traceback Seq 2 (j)ARTS Seq 1 (i)01234 001234 M111234 A221234 T332223 H443333 S554443

18 The solutions - #1 10110=3DMRRMDMRRMMATHS-ARTS10110=3DMRRMDMRRMMATHS-ARTS

19 The traceback Seq 2 (j)ARTS Seq 1 (i)01234 001234 M111234 A221234 T332223 H443333 S554443

20 The solutions - #2 101010=3DMIMDMDMIMDMMA-THS-ART-S101010=3DMIMDMDMIMDMMA-THS-ART-S

21 The traceback Seq 2 (j)ARTS Seq 1 (i)01234 001234 M111234 A221234 T332223 H443333 S554443

22 The solutions - #3 11010=3 RRMDM MATHS ART -S “Life must be lived forwards and understood backwards.” - Søren Kierkegaard

23 BLOSUM62 SCORING MATRIX 134 LQQGELDLVMTSDILPRSELHYSPMFDFEVRLVLAPDHPLASKTQITPEDLASETL LI | ||| | | |||||| | || || 137 LDSNSVDLVLMGVPPRNVEVEAEAFMDNPLVVIAPPDHPLAGERAISLARLAEETF VM D:D = +6 D:R = -2 From Henikoff 1996

24 Scoring Matrices Physical/Chemical similarities - comparing two sequences according to the properties of their residues may highlight regions of structural similarity Identity matrices - by stressing only identities in the alignment, stretches of sequence that may have diverged will not penalise any remaining common features

25 Scoring Matrices (ctd) As the direct source of residue by residue comparison scores the scoring matrix you choose will have a major impact on the alignment calculated The most commonly used will be one of the mutation matrices PAM, BLOSUM The matrix that performs best will be the matrix that reflects the evolutionary separation of the sequences being aligned


Download ppt "Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000."

Similar presentations


Ads by Google