Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons.

Similar presentations


Presentation on theme: "Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons."— Presentation transcript:

1 Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons

2 Bioiformatics I Fall 20022 Need a method that is both reliable and efficient to compare two sequences Exhaustive comparison of every possible alignment will give good answers but takes too much time Need a method that is both reliable and efficient to compare two sequences Exhaustive comparison of every possible alignment will give good answers but takes too much time

3 Bioiformatics I Fall 20023 Dynamic programming: strategy Break alignment problem into small pieces Optimize first piece Then extend into second piece; since first piece is optimized already, program only needs to optimize extension Continue until end of comparison Break alignment problem into small pieces Optimize first piece Then extend into second piece; since first piece is optimized already, program only needs to optimize extension Continue until end of comparison

4 Bioiformatics I Fall 20024 Gaps Remember we said we need to penalize gaps (mimicking evolution) Simplest gap scoring: assign the same penalty (d) to every gap space: this is not very realistic More advanced gap scoring: assign a larger penalty (d) to the first space of a gap, a smaller penalty (e) to the following spaces of the same gap: affine scoring Remember we said we need to penalize gaps (mimicking evolution) Simplest gap scoring: assign the same penalty (d) to every gap space: this is not very realistic More advanced gap scoring: assign a larger penalty (d) to the first space of a gap, a smaller penalty (e) to the following spaces of the same gap: affine scoring

5 Bioiformatics I Fall 20025 Global alignment: Needleman-Wunsch What you need to start: Matrix of sequences to be aligned example: sequence example from text Substitution matrix (choose one that makes sense) example: BLOSUM50 Gap penalty example: -8 Start at 0 (top left) – this allows a “gap” in the beginning of the alignment What you need to start: Matrix of sequences to be aligned example: sequence example from text Substitution matrix (choose one that makes sense) example: BLOSUM50 Gap penalty example: -8 Start at 0 (top left) – this allows a “gap” in the beginning of the alignment

6 Bioiformatics I Fall 20026 Dynamic programming process Fill in the matrix starting from the top left; each time you move away from a diagonal you add a gap penalty to the score in the position you started in; each time you move on a diagonal you add the score from the substitution matrix

7 Bioiformatics I Fall 20027 Fill in the values for “gaps” at the beginning (start with 0) HEAG 0-8-16-24-32 P A W H

8 Bioiformatics I Fall 20028 For example, if you aligned the H with an empty space, you would get a score of –8 for that space in this example HEAG -PAWH Arrow indicates adding score from 0 For example, if you aligned the H with an empty space, you would get a score of –8 for that space in this example HEAG -PAWH Arrow indicates adding score from 0

9 Bioiformatics I Fall 20029 If you aligned both H and E with an empty space, you would get a score of –16 in the E space, because you add the gap penalty onto the score in the preceding space (didn’t move diagonally); arrow from –8. HEAG --PAWH If you aligned both H and E with an empty space, you would get a score of –16 in the E space, because you add the gap penalty onto the score in the preceding space (didn’t move diagonally); arrow from –8. HEAG --PAWH

10 Bioiformatics I Fall 200210 Similar reasoning allows you to fill in the first column HEAG 0-8-16-24-32 P-8 A-16 W-24 H-32

11 Bioiformatics I Fall 200211 Now, there are 3 possibilities to fill each remaing matrix element. So, if you aligned P with H, you move from 0 along the diagonal, so you add the substitution matrix value of -2. HEAG 0-8-16-24-32 P-8 A-16 W-24 H-32 -2

12 Bioiformatics I Fall 200212 Or, you could start with H aligned with a gap, and then align P with a gap H- -P HEAG 0-8-16-24-32 P-8 A-16 W-24 H-32 -16

13 Bioiformatics I Fall 200213 Or, you could start with P aligned with a gap, and then align H with a gap -H P- HEAG 0-8-16-24-32 P-8 A-16 W-24 H-32 -16

14 Bioiformatics I Fall 200214 We choose the highest value, and preserve it and the information about where we started to get there (arrow) HEAG 0-8-16-24-32 P-8 A-16 W-24 H-32 -2 -16

15 Bioiformatics I Fall 200215 Now we get to the P/E matrix element. There are 3 ways we could get to this position: HE.. -P.. HE... P-.. HE-.. --P.. Now we get to the P/E matrix element. There are 3 ways we could get to this position: HE.. -P.. HE... P-.. HE-.. --P..

16 Bioiformatics I Fall 200216 Note that only one of these possibilities actually aligns P with E; that is the one that moves diagonally These possibilities have different scores; we enter the highest score, and draw an arrow to the matrix element from which we moved to get this score Note that only one of these possibilities actually aligns P with E; that is the one that moves diagonally These possibilities have different scores; we enter the highest score, and draw an arrow to the matrix element from which we moved to get this score

17 Bioiformatics I Fall 200217 HE.. -P.. HE... P-.. HE-. --P. HE.. -P.. HE... P-.. HE-. --P. Score = -8 + -1 = -9 Score = -2 + -8 = -10 Score = -16 + -8 = -24

18 Bioiformatics I Fall 200218 HEAG 0-8-16-24-32 P-8-2-9 A-16 W-24 H-32 In this case, the highest score from the three parent matrix elements was along the diagonal

19 Bioiformatics I Fall 200219 Using the same logic, you can fill in all the other cells in the matrix We can also express this process using matrix notation X and Y are sequences; X 1…i, Y 1…j Matrix F, F(i,j) is the score of the best alignment between the initial part of x (to x i ) and the initial part of y (to y j ) Using the same logic, you can fill in all the other cells in the matrix We can also express this process using matrix notation X and Y are sequences; X 1…i, Y 1…j Matrix F, F(i,j) is the score of the best alignment between the initial part of x (to x i ) and the initial part of y (to y j )

20 Bioiformatics I Fall 200220 Remember, the strategy is to optimize the first bits and then extend; so we are looking for the best score of F(i,j) which can come from extending from F(i-1,j-1) diagonal, or F(i-1,j) across or F(i, j-1) down Since we started at the beginning of the sequences, this process takes account of all possible alignments, giving us the best one Remember, the strategy is to optimize the first bits and then extend; so we are looking for the best score of F(i,j) which can come from extending from F(i-1,j-1) diagonal, or F(i-1,j) across or F(i, j-1) down Since we started at the beginning of the sequences, this process takes account of all possible alignments, giving us the best one

21 Bioiformatics I Fall 200221 We can express this by: F(i-1, j-1) + s(x i, y j ), F(i-1, j) - d F(i, j-1) – d where s = score from substitution matrix and d = linear gap penalty We can express this by: F(i-1, j-1) + s(x i, y j ), F(i-1, j) - d F(i, j-1) – d where s = score from substitution matrix and d = linear gap penalty F(i,j) = max

22 Bioiformatics I Fall 200222 So now what? So now, we look for the path through the matrix that gives the final score – in this kind of global alignment, the last cell of the matrix is by definition the best score for the alignment. Looking for the path is called traceback – you follow the pointers that got you to the end (like Hansel and Gretel …)

23 Bioiformatics I Fall 200223 By following the arrows, you can arrive at the alignment Only one alignment is found in this treatment, but the algorithm can be modified to recover more than one see example from the text By following the arrows, you can arrive at the alignment Only one alignment is found in this treatment, but the algorithm can be modified to recover more than one see example from the text

24 Bioiformatics I Fall 200224 In-class exercise II Using identity scoring and a gap penalty d = 1 (consider spaces before and after ends of sequences to be gaps), complete the matrix on the following slide Do a traceback to find the optimal alignment Using identity scoring and a gap penalty d = 1 (consider spaces before and after ends of sequences to be gaps), complete the matrix on the following slide Do a traceback to find the optimal alignment

25 Bioiformatics I Fall 200225 In-class exercise II: complete the matrix GAACTTA 0 A C C T T T

26 Bioiformatics I Fall 200226 In-class exercise III Use Gap program to align sequences in nosalign file Vary the gap initiation penalty and the gap extension penalty; compare alignments Change the substitution matrix keeping all other variables same; compare alignments Use Gap program to align two unrelated sequences Use Gap program to align sequences in nosalign file Vary the gap initiation penalty and the gap extension penalty; compare alignments Change the substitution matrix keeping all other variables same; compare alignments Use Gap program to align two unrelated sequences

27 Bioiformatics I Fall 200227 Instructions for Gap exercise In seqlab, bioinfI.list, select nosalign; get into Editor Select 2 sequences; use info button if necessary to find out what these sequences are; select Edit  Remove gaps  All gaps Select Functions  Pairwise Comparison  Gap Select Options; select penalize end gaps like other gaps, then Close, then Run In seqlab, bioinfI.list, select nosalign; get into Editor Select 2 sequences; use info button if necessary to find out what these sequences are; select Edit  Remove gaps  All gaps Select Functions  Pairwise Comparison  Gap Select Options; select penalize end gaps like other gaps, then Close, then Run

28 Bioiformatics I Fall 200228 Note the quality score of this alignment Now systematically vary the gap penalties and the substitution matrices and run the program (always penalizing end gaps) on the same pair of sequences; note the quality scores for each variation See what happens if you don’t penalize end gaps Don’t save this as anything, just go to main list when you are done Note the quality score of this alignment Now systematically vary the gap penalties and the substitution matrices and run the program (always penalizing end gaps) on the same pair of sequences; note the quality scores for each variation See what happens if you don’t penalize end gaps Don’t save this as anything, just go to main list when you are done

29 Bioiformatics I Fall 200229 Go back to main list; select unrelated; use info button to find out what these sequences are Run the Gap program (penalizing end gaps) Is this alignment meaningful? Check by using the Generate statistics from randomized alignments feature in options; choose preserving nucleotide or amino acid composition and take other defaults; note the average score and standard deviation from randomizations and compare to the score of the alignment See what happens when you don’t penalize end gaps Go back to main list; select unrelated; use info button to find out what these sequences are Run the Gap program (penalizing end gaps) Is this alignment meaningful? Check by using the Generate statistics from randomized alignments feature in options; choose preserving nucleotide or amino acid composition and take other defaults; note the average score and standard deviation from randomizations and compare to the score of the alignment See what happens when you don’t penalize end gaps

30 Bioiformatics I Fall 200230 Local alignment: Smith- Waterman This is very similar to Needleman- Wunsch, with two major differences: Must allow for starting a new alignment rather than extending one Must allow for alignment to end before the end of the sequences This is very similar to Needleman- Wunsch, with two major differences: Must allow for starting a new alignment rather than extending one Must allow for alignment to end before the end of the sequences

31 Bioiformatics I Fall 200231 Allowing for starting a new alignment is done by allowing F(i,j) to take the value 0 if all other options are <0 0 F(i -1, j-1) + s(x i, y j ) F(i – 1, j) – d F(i, j – 1) - d Allowing for starting a new alignment is done by allowing F(i,j) to take the value 0 if all other options are <0 0 F(i -1, j-1) + s(x i, y j ) F(i – 1, j) – d F(i, j – 1) - d F(I,j) = max

32 Bioiformatics I Fall 200232 Allowing for the alignment to end before the end of the sequence is taken care of by looking for the highest score in the matrix, and starting the traceback from there until a 0 is reached.

33 Bioiformatics I Fall 200233 In-class exercise IV Use Bestfit to find local alignments for the same sequences (from nosalign and from unrelated) you used in the previous exercises; note that you do not have an option about penalizing end gaps as you did in Gap Vary the same parameters you did before Use randomizations to evaluate alignments Use Bestfit to find local alignments for the same sequences (from nosalign and from unrelated) you used in the previous exercises; note that you do not have an option about penalizing end gaps as you did in Gap Vary the same parameters you did before Use randomizations to evaluate alignments

34 Bioiformatics I Fall 200234 Affine gap penalties To distinguish between a gap initiation (d) and a gap extension (e) penalty, we have to distinguish between a sequence element aligned with another sequence element, and one aligned to a gap There are two ways to be aligned to a gap: x i aligned to a gap in y, or y j aligned to a gap in x To distinguish between a gap initiation (d) and a gap extension (e) penalty, we have to distinguish between a sequence element aligned with another sequence element, and one aligned to a gap There are two ways to be aligned to a gap: x i aligned to a gap in y, or y j aligned to a gap in x

35 Bioiformatics I Fall 200235 In building our F matrix, remember we want to maximize the score of the extension, so we have to take into account the three parent possibilities. We simply define 3 cases and how to extend them; so the algorithm extends according to which case it starts from. M(i,j) = best score for two sequence characters aligned I x = best score for x i aligned with a gap in y I y = best score for y j aligned with a gap in x In building our F matrix, remember we want to maximize the score of the extension, so we have to take into account the three parent possibilities. We simply define 3 cases and how to extend them; so the algorithm extends according to which case it starts from. M(i,j) = best score for two sequence characters aligned I x = best score for x i aligned with a gap in y I y = best score for y j aligned with a gap in x

36 Bioiformatics I Fall 200236 M(i-1,j-1) + s(x i,y j ) I x (i-1,j-1) + s(x i,y j ) I y (i-1,j-1) + s(x i,y j ) M(i-1,j) – d I x (i-1,j) – e M(i, j-1) – d I y (i,j-1) - e M(i-1,j-1) + s(x i,y j ) I x (i-1,j-1) + s(x i,y j ) I y (i-1,j-1) + s(x i,y j ) M(i-1,j) – d I x (i-1,j) – e M(i, j-1) – d I y (i,j-1) - e M (i,j) = max I x (I,j) = max I y (I,j) = max


Download ppt "Bioiformatics I Fall 20021 Dynamic programming algorithm: pairwise comparisons."

Similar presentations


Ads by Google