Presentation is loading. Please wait.

Presentation is loading. Please wait.

In Bioinformatics use a computational method - Dynamic Programming.

Similar presentations


Presentation on theme: "In Bioinformatics use a computational method - Dynamic Programming."— Presentation transcript:

1 In Bioinformatics use a computational method Dynamic Programming to align two proteins or nucleic acids The term dynamic programming to describe the process of solving problems where one needs to find the best decisions one after another. At first, we select the best path from Start to A, then we select the best path from A to Finish. The choice of the best path from A to Finish is independent of the choice of path from Start to A

2 Thus the path is subdivided into a set of steps.
The goal is to find the optimal way for each step Any step along the true optimal path must itself be the optimal path. This is the main idea of dynamic programming method. Dynamic programming is typically used when a problem has many possible solutions and an optimal one needs to be found.

3 sequence 1 : S D V – Y sequence 2 : S R V L Y 2 -1 2-2 2
Sum of residues pair scores minus gap penalty = -2 Total score =3 Score of new Score of previous Score of new aligned alignment alignment pair sequence 1 : S D V – Y T sequence 2 : S R V L Y T Score =

4 There are two Sequences: A = ACGCTG, B = CATGT The best alignment ?
Question: explain the cell in the first row and the first column

5 – A C G... – C A T...

6 QUESTION: How do we estimate the gap?

7 Question: How do we calculate the score of this alignment?

8 How do we calculate the scores?

9 Question: How do we estimate the mismatch? 0, -1, 1?

10 Question: How do we estimate the match? 0, 1, 2 Thus in this alignment the penalty for a gap is… the score for a mismatch is…

11 Explain the score in the cell G3/ C1
Check the score for mismatch with the previous slides.

12 Check the score in the cell G3/A2

13 After filling in all of the values the score matrix is as follows:

14 The next procedure is the traceback step.
The traceback step determines the actual alignment that result in the maximum score. The traceback step begins in the N,M position in the matrix, i.e. the position where both sequences are globally aligned

15 The algorithm of the traceback: a) step begins with the last cell
Traceback takes the current cell and looks to the neighbor cells that could be direct predacessors:  to the neighbor to the left (gap in sequence #2),  the diagonal neighbor (match/mismatch), and  the neighbor above it (gap in sequence #1). there is a G6/T5 in this case).

16 For the current cell there are two possible predacessors with the maximum score 3.
b) If more than one possible predacessor ( left and  above) with the same maximum score exists, any can be chosen. If the diagonal neighbor  has the same maximum score, diagonal way is selected to avoid a gap. Variant 1: select left cell  as the predacessor. …TG …T - Select the best alignment and compare with the alignment at the next slide.

17 Question: Does your alignment coincide with this one?
Make another possible alignment (Variant 2) and then compare it with the alignment at the next slide.

18 Variant 2 Question: What are the maximum scores of these two possible alignments?

19 Multiple sequence alignment (MSA)
Multiple sequence alignments are a challenge for bioinformaticians compared to pairwise alignments. Thus, heuristic approaches are needed to make it feasible to align multiple sequences.

20 What to align? Proteins Nucleotides Codons
An important consideration lies in the choice of sequence to align. Protein alignments contain most information, due to the alphabet size of 20. Nucleotide alignments are notoriously difficult at larger evolutionary distances. A special case of high information content are codon alignments. These reveal themselves by a three nucleotide periodicity in gap sizes and positions of non-conserved residues (codon wobble) Codons

21 When are sequences Similar ?
Apart from sequence similarity, it depends on: Nucleotide / protein Nucleotide: 4 different residues → more likely to be similar by chance Proteins: 20 different residues → less likely to be similar by change Sequence length Short sequences → similarity by chance The question that is raised now is of course: when is there homology? If it is an inference made by the researcher, when can this inference be made? There is no standard answer to this. A variety of factors contribute: The degree of similarity is the first important step. The degree of similarity necessary to infer homology depends on the type of sequences compared. Since nucleotides only have four residues, they are much more likely to be similar by chance. Because proteins have 20 residues, they are much less likely to be similar by chance. So the requirements regarding similarity are much higher for nucleotides. Sequence length is also important. Short sequences have a higher probability of being similar by chance, so the requirements regarding the degree of similarity is much greater for short sequences.

22 Similarity vs. identity
Identity: the same residue Similarity: similar physiochemical characteristics (can more readily be substituted) Until now, we have talked about similarity without really defining it. The easiest way to define similarity is by contrasting it to identity. Identity refers to 100% correspondence between residues. E.g. a T is identical to a T. If substitutions of all residues happened at an equal rate, there would be no need for this distinction between identity and similarity. But since some residues are more readily substituted with others, two residues can be similar without being identical. For proteins, if two amino acids have the same physiochemical characteristics, you will find more substitutions by these two amino acids. For nucleotides, a pyrimidine is more likely to be substituted for another pyrimidine than a purine (related to synonymous substitutions in coding regions).

23 Algorithms - overview Goal: comparing conserved regions Two methods:
Global Local Three techniques Dot plots Dynamic programming Word-based The previous concepts refer to all kinds of sequence comparisons like alignments and dot plots. Since sequence comparisons are almost always performed to compare conserved regions in a set of sequences, sequence alignments is the approach which is most widely used. There are two overall approaches to alignments: local and global alignments. Furthermore there are three techniques to do the actual comparison. This will be described next.

24 Local vs. global alignments
sequence 1: ACTCCGTAGGTTGGACTCC sequence 2: CTCTGGTAGGCTTACTCTG Global alignment Local alignment When comparing sequences, you can either compare the sequences over their entire lengths, or you can compare the regions that are most conserved. “Global” refers to the first case, “local” to the second case. In this example, two sequences are compared using first a global approach and subsequently a local approach. Local alignments are preferable with sequence comparisons where you expect large regions which are not conserved and you wish to investigate the regions of high similarity. Global alignments are best if you wish to examine the overall similarity of sequences.

25 Agenda Sum of Pairs method ClustalW Gap penalties

26 Sum of Pairs (SP) method
Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Which MSA to choose? A Q P I L L L V A L R - L L - - A K - I L L L - C P P V L I L V The problem is decomposed using the sum of pairs method. Consider aligning the following 4 protein sequences. Which of these two MSA should we choose? A Q P I L L L V A - - L R L L - - A K I L L L - C P P V L I L V

27 Sum of Pairs cont. Assume: c(match) = 1 , c(mismatch) = -1 ,
c(gap) = -2, c(-, -) = 0. Then the SP score for the 4th column of the MSA would be SP(column4) = SP(I,-,I,V) = c(I,-) + c(I,I) + c(I,V) + c(-,I) + c(-,V) + c(I,V) = (-1) (-2) + (-2) (-1) = -7 A Q P I L L L V A L R - L L - - A K - I L L L - C P P V L I L V The sum of pairs method says, given the cost for match = 1, mismatch = -1, gap = -2 and gap over gap = 0, the SP score for the 4th column of the MSA can be calculated by summing all pairwise scores.

28 Sum of Pairs cont. To find SP(MSA) we would find the score of each column mi and then SUM all SP(mi) scores to get the score MSA. To find the optimal score using this method we need to consider all possible MSA. To find SP(MSA) we would find the score of each column mi and then SUM all SP(mi) scores to get the score MSA. To find the optimal score using this method we need to consider all possible MSA.

29 The ClustalW method ClustalW is a progressive method for MSA
Start: pairwise determine the most related sequences progressively add less related sequences or groups of sequences to the initial alignment. One such trick is implemented in the CLUSTAL method. ClustalW is a progressive method for MSA “Progressive” means that the algorithm starts by pairwise determining the most related sequences and then progressively add less related sequences or groups of sequences to the initial alignment.

30 ClustalW steps All Pairwise Alignments Dendrogram Similarity Matrix
Multiple Alignment Step: Aligning S1 and S3 Aligning S2 and S4 Aligning (S1, S3) with (S2,S4) All Pairwise Alignments Dendrogram Similarity Matrix Briefly, The Clustal method calculates all possible pariwise alignment ‘distances’ between the input sequences. The scores are stored in a similarity matrix, which can be used to cluster the sequence in a dendrogram. The dendrogram is then used as ‘guide tree’, telling the algorithm in which order to align the sequences. Cluster Analysis From Higgins(1991) and Thompson(1994).

31 ClustalW Step 1 Use a pairwise alignment method to compute all pairwise alignments amongst the sequences. Look at the non-gapped position and count the number of mismatches between the two sequences, then divide this value by the number of non-gapped pairs to calculate the distance NKL-ON distance = 1/4 = 0.25 -MLNON In the first step, a pairwise alignment method is used to compute all pairwise alignments amongst the sequences. Then, the non-gapped position are considered only and the number of mismatches between the two sequences is counted and divided by the number of non-gapped pairs to give the distance.

32 ClustalW Step 1 continued
Seq. S1 S2 S3 S4 S1 - S (SYMMETRIC) S S This way all pairwise distances are stored in a distance matrix.

33 ClustalW Step 2 Construct a similarity tree (Guide tree).
The root is placed a the midpoint of the longest chain of consecutive edges. S3 S4 S 1 3 2 4 From the distance matrix, a custering algorithm is used to produce a rooted similarity tree. S1 S2

34 ClustalW Step 3 Combine alignments: In our example, we align:
from the most closely related groups to the most distantly related groups going from tip of tree to the root of the tree. In our example, we align: S1 with S2 (grp1) S3 with S4 (grp2) grp1 with grp2 continue until the root is reached. Each alignment involves dynamic programming by the SP score method. Now the complexity has been reduced to that of a series of pairwise alignments The rooted guide tree is used to guide the alignment of sequences. The alignments are combined starting from the most closely related groups to the most distantly related groups by going from tip of tree to the root of the tree. In our example we first align S1 with S2 (profile 1) then S3 with S4 (profile 2), then align grp1 with grp2, we continue until the root is reached. Each alignment (sequence-sequence, sequence-profile, profile-profile) involves dynamic programming by the SP score method. As an effect, the complexity has been reduced to that of a series of pairwise alignments.

35 Distance between sequences - measure from the guide tree - determines which matrix to use
80-100% seq-id -> Blosum80 60-80% seq-id -> Blosum60 30-60% seq-id -> Blosum45 0-30% seq-id -> Blosum30 Which scoring matrix should be used? Even this parameter choice can be automated by considering the observed distance between sequences. This measure from the guide tree determines which matrix to use.

36 Gap penalties Gap Opening Penalty (GOP) Gap extension penalty (GEP)
GTEAIVLMANKL G KL Gap Penalty: GOP+8*GEP A detail of ClustalW is the setting of gap penalties, which has been optimized according to structural protein alignments. The Gap Opening Penalty models the cost associated with the location of a gap. The Gap extension penalty models the cost associated with the length of a gap. The total gap penalty is a combination of the two penalties.

37 Modifications of gap penalty
Gap at position low GOP (Residue specific penalties) gap within 8 residues? -> increase GOP (Gap separation distance) Hydrophilic residues lower GOP (Hydrophilic-Hydrophobic gap penalties) Furthermore, gap position penalties are influenced by the presence of other gaps in the neighborhood. In the same line of thoughts, the observation that most gaps reside next to hydrophobic residues, which usually are located at more flexible, less constrained outer parts of protein tertiary structure, is modeled accordingly.

38


Download ppt "In Bioinformatics use a computational method - Dynamic Programming."

Similar presentations


Ads by Google