Presentation is loading. Please wait.

Presentation is loading. Please wait.

DNA, RNA and protein are an alien language

Similar presentations


Presentation on theme: "DNA, RNA and protein are an alien language"— Presentation transcript:

1 DNA, RNA and protein are an alien language
DNA, RNA and protein are an alien language ... We try to cryptographically attack this language ... we want to decipher both its meaning and its history …

2 We do not have to understand the languaje to identify patterns:
Fortunate the genetic code is alphabetic … susceptible to perform string comparisons and pattern recognition We do not have to understand the languaje to identify patterns: “klaatu barada nikto” (El Día que la Tierra se Paralizó) Miescher 1892

3 Pairwise Sequence Alignment

4 Pairwise Sequence Alignment
Principles of pairwise sequence comparison global / local alignments scoring systems gap penalties Methods of pairwise sequence alignment window-based methods dynamic programming approaches These two methods are generally used to obtain alignments. They serve as a basis for many other operations in computational biology. For homology searches in databases both methods are combined. We will come to this later in our db search session.

5 Pairwise Sequence Alignment: How to?
A T T C A C A T A T A C A T T A C G T A C Sequence 2 Sequence 1

6 Dotplot: A dotplot gives an overview of all possible alignments
A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C Sequence 2 In the following I will often use dotplots and alignment matrices to explain alignment algorithms. The dotplot technique: a dotplot allows visual inspection of all possible alignments. The two sequences to be aligned are written out as column and row headings of a so called alignment matrix. Note that the vertical sequence is read from bottom to top. Dots are put in the matrix when the symbols of the two sequences are identical. Sequence 1

7 Dotplot: In a dotplot each diagonal corresponds to a possible (ungapped) alignment A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C Sequence 2 A dotplot gives an overview of all possible alignments of two sequences. Each diagonal represents one possible alignment. Sequence 1 T A C A T T A C G T A C A T A C A C T T A One possible alignment:

8 Pairwise Sequence Alignment
Principles of pairwise sequence comparison global / local alignments scoring systems gap penalties Methods of pairwise sequence alignment window-based methods dynamic programming approaches These two methods are generally used to obtain alignments. They serve as a basis for many other operations in computational biology. For homology searches in databases both methods are combined. We will come to this later in our db search session.

9 Window-based Approaches
Word Size Window / Stringency Windows-based approaches are quick methods used for database searches There are two different approaches: - word size algorithm, searching for short identities - window/stringency, searching for short similar regions, without gaps Neither one of the methods uses gap penalties!

10 Word Size Algorithm T A C G G T A T G Word Size = 3 A C A G T A T C
C T A T  G A C A T A C G G T A T G T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C A window with a user defined word size slides across the aligned sequences. With the word size of 3 a dot is drawn only if three neighbouring nucleotides match. The search for short identities of all possible alignments. Note that all items within words must match and the “word” cannot be splitted by gaps. This gives problems in comparing protein sequences. The word algorithm is not very sensitive. It is not suited to detect weak homologies. T A C G G T A T G A C A G T A T C

11 Window / Stringency T A C G G T A T G Window = 5 / Stringency = 4
T C A G T A T C Window = 5 / Stringency = 4 C T A  T  G  A C A T A C G G T A T G T A C G G T A T G T C A G T A T C T A C G G T A T G T C A G T A T C The problem with the sensitivity can be overcome with the permission of mismatches in a word. Simply by defining a word size and a minmal number of matches. GCG programs call this stringency. Dotplots generated this way are more sensitive. T A C G G T A T G T C A G T A T C

12 Considerations The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted). The smaller the window, the larger the weight of statistical (unspecific) matches. With large windows the sensitivity for short sequences is reduced. Insertions/deletions are not treated explicitly.

13 Insertions / Deletions in a Dotplot
Sequence 2 T A C G T A C T G T T C A T This alignment contains one gap. In the corresponding dotplot the diagonals of the alignment are drawn and then they are shifted one position. Sequence 1 T A C T G - T C A T | | | | | | | | | T A C T G T T C A T

14 Dotplot (Window = 130 / Stringency = 9)
Hemoglobin -chain Output of the programs Compare and DotPlot With the programs Compare and dotplot you can create a visual alignment. If you run Compare with the default parameters aligning very similar sequences the dotplot gets very crowded. You can filter these results either by reducing the windowsize or by increasing the stringency. Hemoglobin -chain

15 Dotplot (Window = 18 / Stringency = 10)
Hemoglobin -chain Output of the programs Compare and DotPlot Here we changed the size of the window from 30 to 18 and we changed the stringency from 9 to 10. Hemoglobin -chain

16 Pairwise Sequence Alignment
Principles of pairwise sequence comparison global / local alignments scoring systems gap penalties Methods of pairwise sequence alignment window-based approaches dynamic programming approaches Needleman and Wunsch Smith and Waterman Window based approaches are quick methods for the identification of sequence similarities. However, for computing an optimal alignment of two sequences one has to use another approach: dynamic programming.

17 Dynamic Programming Automatic procedure that finds the best alignment
with an optimal score depending on the chosen parameters. Recursive solutions. We solve smaller problems first, and use those solutions to solve larger problems. Intermediate solutions are stored in a tabular matrix. As we have seen in the last section, the GCG program Gap uses the Needleman & Wunsch algorithm to compute a global alignment. The GCG programs Similarity and Bestfit compute local alignment. They use the Smith & Waterman algorithm to identify a region (or regions) of highest similarity. The Needleman & Wunsch algorithm aligns a pair of sequences over their entire lengths while the Smith-Waterman algorithm finds the best matching regions in the same pair of sequences. Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. Very often two sequences share only a single functional domain.

18 Basic principles of dynamic programming
- Initialization of alignment matrix: the scoring model - Stepwise calculation of score values (creation of an alignment path matrix) - Backtracking (evaluation of the optimal path) The basic principles of dynamic programming. Basically there are three steps: - Creation of a alignment path matrix - Stepwise calculation of score values - Backtracking: evaluation of the optimal path

19 Initialization of Matrix (BLOSUM 50): A distance metric
H E A G A W G H E E P A W H E The score matrix for the two example sequences showing the BLOSUM50 values for each aligned residue pair. Positive scores are in bold

20 Needleman and Wunsch (global alignment)
Sequence 1: H E A G A W G H E E Sequence 2: P A W H E A E Scoring parameters: BLOSUM50 matrix Gap penalty: Linear gap penalty of 8 First, we will take a closer look at the Needleman-Wunsch algorithm. We will align these two simple sequences. Because we introduced the scoring scheme as log-odds ratio, the scores are additive and better alignments will have higher scores. For simplicity, we will use a linear gap penalty.

21 Creation of an alignment path matrix
Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj Build F(i,j) recursively beginning with F(0,0) = 0 E H - E - A P G - A W G - H E - A Optimal global alignment:

22 Creation of an alignment path matrix
H E A G A W G H E E P A W H E A E HEAGAWGHE-E --P-AW-HEAE Optimal global alignment:

23 Creation of an alignment path matrix
F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = max F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d F(i-1, j-1) F(i, j-1) F(i-1,j) F(i, j) HEAGAWGHE-E --P-AW-HEAE s(xi ,yj) -d -d

24 Creation of an alignment path matrix
If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d The best score up to (i,j) will be the largest of the three options

25 Creation of an alignment path matrix
H E A G A W G H E E P A W H E -8 -16 -24 -32 -40 -48 -56 Boundary conditions F(i, 0) = -i d F(j, 0) = -j d To fill the top row and the left column we need some boundary conditions. Top row: j=0 so F(i,j-1) and F(i-1,j-1) do not exist. Since the F(i,0) values represent gaps we can define: F(i,0) = -id. When filling the matrix we will keep a pointer in each cellback to the cell from which its F(i,j) was derived. Left column: i=0, so F0,j) = -jd.

26 Stepwise calculation of score values
H E A G A W G H E E P -8 A -16 W -24 H -32 E -40 A -48 E -56 F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = max F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d P-H=-2 E-P=-1 H-A=-2 E-A=-1 -2 -9 -10 -3 F(0,0) + s(xi ,yj) = = -2 F(1,1) = max F(0,1) - d = -8 -8= = -2 F(1,0) - d = -8 -8= -16 F(1,0) + s(xi ,yj) = = -9 F(2,1) = max F(1,1) - d = = = -9 F(2,0) - d = = -24 Filling the alignment path matrix step by step. = -10 F(1,2) = max = -24 = -10 = -10 = -3 F(2,2) = max = = -3 = -17

27 Backtracking H E A G A W G H E E
P A W H E A E -8 -16 -17 -25 -20 -5 -13 -3 3 The alignment path matrix is now filled completely. The value of the final cellof the matrix F(10,7) at the bottom right corner is by definition the best score for the global alignment of our two sequences. To find the alignment itself we must find the path of choices that lead to this final value. The procedure to do this is called backtracking. - Build the alignment in reverse, starting from the final cell following the pointers that we stored when building the matrix. - At each step we add a pair of symbols to the front end of the alignment. -5 1 E H - E - A P G - A W G - H E - A Optimal global alignment:

28 Smith and Waterman (local alignment)
Two differences: 1. 2. An alignment can now end anywhere in the matrix F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d F(i, j) = max Whith the Smith Waterman algorithm we can look for the best alignment between subsequences of sequence x and sequence y. This arises for example when we suspect two sequences to share a commen domain or when we compare extended stretches of genomic DNA. It is also the sensitive method to detect highly diverged sequences. There are two differences to the Needleman and Wunsch algorithm. 1. An extra possibilityof 0 is added to the equation. The value taking 0 corresponds to starting a new alignment. As a consequence the top row and the left column are filled with 0. 2. An alignment can end anywhere in the matrix. So we can look for the highest value over the whole and start a backtracking from there. A traceback ends when a cell with value 0 is reached, which corresponds to the start of the alignment. Example: Sequence 1 H E A G A W G H E E Sequence 2 P A W H E A E Scoring parameters: Log-odds ratios Gap penalty: Linear gap penalty of 8

29 Smith Waterman alignment
H E A G A W G H E E P A W H E A E 5 20 12 22 28 Our example sequences aligned with the Smith Waterman algorithm. The optimal local alignment is shown below. E AA WW G- HH Optimal local alignment:

30 Extended Smith & Waterman
To get multiple local alignments: delete regions around best path repeat backtracking

31 Extended Smith & Waterman
H E A G A W G H E E P A W H E A E 5 Our example sequences aligned with the Smith Waterman algorithm. The optimal local alignment is shown below.

32 Extended Smith & Waterman
H E A G A W G H E E P A W H E A E 10 16 Our example sequences aligned with the Smith Waterman algorithm. The optimal local alignment is shown below. 21 AA H EE Second best local alignment:

33 Further Extensions of Dynamic Programming
Overlap matches Alignment with affine gap scores The dynamic programming algorithms can be extended to deal with overlap matches e.g. when comparing genomic DNA fragments to each another. And we can include affine gap penaties. Basically these are variations one the same theme. Who wants to know more about it could dive into the literature.

34 Pairwise Sequence Alignment
Pairwise sequence comparison global / local alignments parameters scoring systems insertions / deletions Methods of pairwise sequence alignment dotplot windows-based methods dynamic programming algorithm complexity

35 End.of.pa.irwise..sequence
| | | | | align.ment.cours.e

36 Methods of Pairwise Comparison
Progressive Alignment: step Multiple Alignment 1. Methods of Pairwise Comparison Programs perform global alignments: Needleman & Wunsch: (Pileup, Tree, Clustal) Word Size Method: (Clustal) X. Huang (MAlign) (modified N-W)

37 Construction of a Guide Tree
Progressive Alignment: step Multiple Alignment 2. Construction of a Guide Tree Sequence 1 2 3 4 5 Similarity Matrix: displays scores of all sequence pairs. The similarity matrix is transformed into a distance matrix

38 Construction of a Guide Tree
Progressive Alignment: step Multiple Alignment 2. Construction of a Guide Tree Guide Tree 1 5 Distance Matrix 2 3 4 Neighbour-Joining Method or UPGMA (unweighted pair group method of arithmetic averages)

39 3. Multiple Alignment 2 1 Multiple Alignment Guide Tree 1 5 2 3 4
Progressive Alignment: step Multiple Alignment 3. Multiple Alignment Guide Tree 1 5 2 3 2 4 1

40 Columns - once aligned - are never changed
Progressive Alignment: step Multiple Alignment 3. Columns - once aligned - are never changed G T C C G - C A G G T T - C G C C - G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G T T A C T T C C A G G

41 Columns - once aligned - are never changed
Progressive Alignment: step Multiple Alignment 3. Columns - once aligned - are never changed G T C C G - C A G G T T - C G C C - G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G T T A C T T C C A G G and new gaps are inserted.

42 Columns - once aligned - are never changed
Progressive Alignment: step Multiple Alignment 3. Columns - once aligned - are never changed G T C C G - - C A G G T T - C G C - C - G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G T T A C T T C C A G G A T C - T - - C A A T C T G - T C C C T A G A T C T - - C A A T C T G T C C C T A G

43 Sub-sequence alignments

44 A K-means like clustering problem

45 Clustering resulting model

46 Clustering predictions

47 Assignments Describe a pairwise alignment with a different gap penalization. Provide an example and perform a multiple global alignment. Describe the recipe. Provide an example and and perform a multiple alignment of subsequences. Describe the recipe. Algorithms Order (polynomial, exponential, NP)

48 Algorithmic Complexity
How does an algorithm‘s performance in CPU time and required memory storage scale with the size of the problem? Needleman & Wunsch Storing (n+1)x(m+1) numbers Each number costs a constant number of calculations to compute (three sums and a max) Algorithm takes O(nm) memory and O(nm) time Since n and m are usually comparable: O(n2) It is useful to know how an algorithm‘s performance in CPU time and required memory storage will scale with the size of the problem. The Needleman and Wusch algorithm stores (n+1)x(m+1) numbers. Each number costs a constant number of calculations to compute (three sums and a max) Algorithm takes O(nm) memory and O(nm) time Since n and m are usually comparable: O(n2) This is called the <big O> notation. The algorithm is of the order nm. With biological sequences and standard computers O(n2) algorithms are feasible but a little slow, while O(n3)algorithms are only feasible for very short sequences.


Download ppt "DNA, RNA and protein are an alien language"

Similar presentations


Ads by Google