1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The.

1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands www.ibivu.cs.vu.nl heringa@cs.vu.nl C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Pair-wise alignment Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2   n 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~10 600 alignments! T D W V T A L K T D W L - - I K

Technique to overcome the combinatorial explosion: Dynamic Programming Alignment is simulated as Markov process, all sequence positions are seen as independent Chances of sequence events are independent –Therefore, probabilities per aligned position need to be multiplied –Amino acid matrices contain so-called log-odds values (log 10 of the probabilities), so probabilities can be summed

To perform statistical analyses on messages or sequences, we need a reference model. The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner. This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment. Given a probability distribution, P i, for the letters in a i.i.d. message, the probability of seeing a particular sequence of letters i, j, k,... n is simply P i P j P k ···P n. As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log P i + log P j + log P k + ··· + log P n. In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score. To say the same more statistically…

Sequence alignment History of the Dynamic Programming (DP) algorithm 1970 Needleman-Wunsch global pair-wise alignment Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53. 1981 Smith-Waterman local pair-wise alignment Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.

Pairwise sequence alignment Global dynamic programming MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution

Three kinds of Gap Penalty schemes: Linear: gp(k)=ak Affine: gp(k)=b+ak (most commonly used) Concave (general): e.g. gp(k)=log(k) k is the gap length Dynamic programming b k gp(k)

Global dynamic programming i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal Value from residue exchange matrix This is a recursive formula

Dynamic programming steps MDAASTILCGSMDAASTILCGS Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS MDAGSTVILCFVG 1.Forward step: Fill in search matrix using appropriate scoring system (exchange matrix and gap penalties) and place back-pointers 2.Get alignment score from ‘last’ cell: 1.Lower right cell (global and semi-global DP) 2.Highest scoring cell anywhere in matrix (local alignment – see later slides) 3.Perform trace-back (using back- pointers placed during forward step) and get alignment

Global dynamic programming PAM250, Gap =6 (linear) SPEARE 0-6-12-18-24-30-36 S-62-4-10-16-22-28 H-12 -42-3-9-14-20-24 A-18-10-30-7-13-18 K-24-16-9-32-4-12 E-30-22-15-5-3-26-6 -30-24-18-12-60 SPEARE S210100 H01 21 A1102-20 K00 30 E0 40 4 These values are copied from the PAM250 matrix (see earlier slide) The extra bottom row and rightmost column give the penalties that would need to be applied due to end gaps Higgs & Attwood, p. 124 S-HAKE SPEARE

Global dynamic programming i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal Problem with simple DP approach: Can only do linear gap penalties Not suitable for affine and concave penalties

Global dynamic programming with concave gap penalties Ngila is a global, pairwise alignment program that uses logarithmic and affine gap costs, i.e. C(g) = a+b*g+c*ln(g). These gap costs are more biologically realistic than the more popular (and efficient) affine gap cost model. Ngila version 1.2.1 includes two new, evolutionary alignment models. These models allow you to find the maximum alignment of two sequences based on biological, evolutionary parameters--no more guessing at biological costs. Additional changes are noted on the website.  Website & Manual: http://scit.us/projects/ngila/http://scit.us/projects/ngila/  Reference: Cartwright RA (2007) Ngila: global pairwise alignments with logarithmic and affine gap costs. Bioinformatics. 23(11):1427-1428

Global dynamic programming Affine and logarithmic gap penalties i-1 j-1 S i,j = s i,j + Max Max{S 0<x<i-1, j-1 - Pi - (i-x-1)Px} S i-1,j-1 Max{S i-1, 0<y<j-1 - Pi - (j-y-1)Px} Gap opening penalty Gap extension penalty

Global dynamic programming Gap o =10, Gap e =2 DWVTALK 0-12-14-16-18-20-22-24 T-128-9-6-5-9-11-14 D 09223-5-3-34 W-16-1325115490-21 V-18-10-4372119 15-16 L-20-14-223463137261 K-22-12-9173353395014 -34-2917392750 DWVTALK T83811998 D12168848 W12523265 V621288106 L46 96145 K85687513 These values are copied from the PAM250 matrix (see earlier slide), after being made non- negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores

Easy DP recipe for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check –preceding row until j-2: apply appropriate gap penalties –preceding row until i-2: apply appropriate gap penalties –and cell[i-1, j-1]: apply score for cell[i-1, j-1] i-1 j-1

DP is a two-step process Forward step: calculate scores Trace back: start at highest score and reconstruct the path leading to the highest score –These two steps lead to the highest scoring alignment (the optimal alignment) –This is guaranteed when you use DP!

Global dynamic programming

Semi-global pairwise alignment Global alignment: all gaps are penalised Semi-global alignment: N- and C-terminal gaps (end-gaps) are not penalised MSTGAVLIY--TS----- ---GGILLFHRTSGTSNS End-gaps

Semi-global dynamic programming - two examples with different gap penalties - Global score is 65 – 10 – 1*2 –10 – 2*2 These values are copied from the PAM250 matrix (see earlier slide), after being made non- negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix)

Semi-global pairwise alignment Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other Danger: if gap penalties high -- really bad alignments for divergent sequences N- or C-terminal amino acids are often small and polar, giving rise to semi-global alignments with tips of sequences matched in cases of low sequence similarity

Local dynamic programming (Smith & Waterman, 1981) LCFVMLAGSTVIVGTR EDASTILCGSEDASTILCGS Amino Acid Exchange Matrix Gap penalties (open, extension) Search matrix Negative numbers AGSTVIVG A-STILCG

Local dynamic programming (Smith & Waterman, 1981) i-1 j-1 S i,j = Max S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)Px} S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)Px} 0 Gap opening penalty Gap extension penalty

Local dynamic programming

Dot plots Way of representing (visualising) sequence similarity without doing dynamic programming (DP) Make same matrix, but locally represent sequence similarity by averaging using a window

Comparing two sequences We want to be able to choose the best alignment between two sequences. A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis. Comparison is done through scoring diagonals of preset length (gapless fragments)

Comparing two sequences using a dot plot Central cell in window takes the (average) score over all window cells Here you see a diagonal (window) of 7 residue pairs, i.e. the window has length 7.

Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold They can identify insertions, deletions, inversions

Global or Local Pairwise alignment Local Global A B A B A B BA AB Local Global A B A C C A BC ABC A B A C C

Globin fold  protein myoglobin PDB: 1MBN Helices are labelled ‘A’ (blue) to ‘H’ (red). D helix can be missing in some globins: what happens with the alignment?

Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain

What does this mean for alignments? Alignments need to be able to skip secondary structural elements to complete domains (i.e. putting gaps opposite these motifs in the shorter sequence). Depending on gap penalties chosen, the algorithm might have difficulty with making such long gaps (for example when using high affine gap penalties), resulting in incorrect alignment.

There are three kinds of pairwise alignments Global alignment – align all residues in both sequences; all gaps are penalised Semi-global alignment – align all residues in both sequences; end gaps are not penalised (zero end gap penalties) Local alignment – align part of each sequence; end gaps are not applicable

Easy global DP recipe for using affine gap penalties (after Gotoh) M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from: any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties; any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties; or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell in bottom row and rightmost column and do trace-back i-1 j-1 Penalty = Pi + gap_length*Pe S i,j = s i,j + Max Max{S 0<x<i-1, j-1 - Pi - (i-x-1)Px} S i-1,j-1 Max{S i-1, 0<y<j-1 - Pi - (j-y-1)Px}

Let’s do an example: global alignment Gotoh’s DP algorithm with affine gap penalties (PAM250, Pi=10, Pe=2) Row and column ‘0’ are filled with 0, -12, -14, -16, … if global alignment is used (for N-terminal end-gaps); also extra row and column at the end to calculate the score including C-terminal end-gap penalties. DWVTALK T0-503110 D4-7-200-40 W-717-6-5-6-2-3 V-2-64002-2 L-4-221 6-3 K0 -20-35 DWVTALK 0-12-14-16-18-20-22-24 T-120-17-14-13-17-19-22 D-14-8-7-14 -13-42 W-16-219-13-19-18 V -2013-3-16 L-20-22-1814-14 K-22-20-21-12 -24-42-41-18-16-14-120 PAM250 Cell (D2, T4) can alternatively come from two cells (same score): ‘high-road’ or ‘low-road’

Let’s do another example: semi-global alignment Gotoh’s DP algorithm with affine gap penalties (PAM250, Pi=10, Pe=2) Starting row and column ‘0’, and extra column at right or extra row at bottom is not necessary when using semi global alignment (zero end- gaps). Rest works as under global alignment. DWVTALK T0-503110 D4-7-200-40 W-717-6-5-6-2-3 V-2-64002-2 L-4-221 6-3 K0 -20-35 DWVTALK T0-503 D4-7 W 21-13 V-2-13259 L K PAM250

Easy local DP recipe for using affine gap penalties (after Gotoh) M[i,j] is optimal alignment (highest scoring alignment until [i, j]) At each cell [i, j] in search matrix, check Max coming from: any cell in preceding row until j-2: add score for cell[i, j] minus appropriate gap penalties; any cell in preceding column until i-2: add score for cell[i, j] minus appropriate gap penalties; or cell[i-1, j-1]: add score for cell[i, j] Select highest scoring cell anywhere in matrix and do trace-back until zero-valued cell or start of sequence(s) i-1 j-1 Penalty = Pi + gap_length*Pe S i,j = Max S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)Px} S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)Px} 0

Let’s do yet another example: local alignment Gotoh’s DP algorithm with affine gap penalties (PAM250, Pi=10, Pe=2) Extra start/end columns/rows not necessary (no end-gaps). Each negative scoring cell is set to zero. Highest scoring cell may be found anywhere in search matrix after calculating it. Trace highest scoring cell back to first cell with zero value (or the beginning of one or both sequences) DWVTALK T0-503110 D4-7-200-40 W-717-6-5-6-2-3 V-2-64002-2 L-4-221 6-3 K0 -20-35 DWVTALK T0003 D4000 W02100 V00259 L0011 K00 PAM250

What to align, nucleotide or amino acid sequences? If ORF then align at protein level –(i) Many mutations within DNA are synonymous, leading to overestimation of sequence divergence if compared at the DNA level. –(ii) Evolutionary relationships can be more finely expressed using a 20×20 amino acid exchange table than using nucleotide exchanges. –(iii) DNA sequences contain non-coding regions which should be avoided in homology searches. Still an issue when translating into (six) protein sequences through a codon table. –(iv) Searching at protein level: frameshifts can occur, leading to stretches of incorrect amino acids and possibly elongation of sequences due to missed stop codons. But frameshifts normally result in stretches of highly unlikely amino acids: can be used as a signal to trace.

A note on reliability  All these matrices are designed using standard evolutionary models.  It is important to understand that evolution is not the same for all proteins, not even for the same regions of proteins.  No single matrix performs best on all sequences. Some are better for sequences with few gaps, and others are better for sequences with fewer identical amino acids.  Therefore, when aligning sequences, applying a general model to all cases is not ideal. Rather, re-adjustment can be used to make the general model better fit the given data.

Alignment summary  If ORF exists, then align at protein level.  Amino acid substitution matrices reflect the log-odds ratio between the evolutionary and random model and therefore help in determining homology via the alignment score.  The evolutionary and random models depend on generalized data used to derive them. This is not an ideal solution.  Apart from the PAM and BLOSUM series, a large number of further matrices have been developed.  Matrices have been made based on DNA, protein structure, information content, etc.  For local alignment, BLOSUM62 is often superior; for distant (global) alignments, BLOSUM50, GONNET, or (still) PAM250 work well.  Remember that gap penalties are always a problem; unlike the matrices themselves, there is no formal way to calculate their values -- you can follow recommended settings, but these are based on trial and error and not on a formal framework.

Pair-wise alignment quality versus sequence identity (Vogt et al., JMB 249, 816-831,1995) Correctly matched residues are determined using structural alignments (comparing protein 3D structures); Error rates become dramatic at lower sequence identity levels – at 25% identical residues 75% of residues is aligned correctly, at 15% sequence identity only 30% is aligned correctly!

1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The.

Similar presentations

Presentation on theme: "1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The.

Similar presentations

Presentation on theme: "1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The."— Presentation transcript:

Similar presentations

About project

Feedback