1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
 If Score(i, j) denotes best score to aligning A[1 : i] and B[1 : j] Score(i-1, j) + galign A[i] with GAP Score(i, j-1) + galign B[j] with GAP Score(i,
Lecture 6, Thursday April 17, 2003
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in bioinformatics makes sense except in.
©CMBI 2005 Sequence Alignment In phylogeny one wants to line up residues that came from a common ancestor. For information transfer one wants to line up.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Developing Pairwise Sequence Alignment Algorithms
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Sequence Analysis Lecture 3 C E N T R F O R I N T E G R A T I V E B I O I N F O.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Sequence Alignment Oct 9, 2002 Joon Lee Genomics & Computational Biology.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Alignments 2: Local alignment Sequence Analysis
1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The.
Algorithms Dr. Nancy Warter-Perez June 19, May 20, 2003 Developing Pairwise Sequence Alignment Algorithms2 Outline Programming workshop 2 solutions.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Bioinformatics in Biosophy
Pairwise & Multiple sequence alignments
Pair-wise Sequence Alignment (II) Introduction to bioinformatics 2008 Lecture 6 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
Arun Goja MITCON BIOPHARMA
Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Introduction to bioinformatics 2007
Introduction to bioinformatics 2007
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
Introduction to bioinformatics 2007
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
1-month Practical Course
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Pair-wise alignment Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2   n 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~ alignments! T D W V T A L K T D W L - - I K

Technique to overcome the combinatorial explosion: Dynamic Programming Alignment is simulated as Markov process, all sequence positions are seen as independent Chances of sequence events are independent –Therefore, probabilities per aligned position need to be multiplied –Amino acid matrices contain so-called log-odds values (log 10 of the probabilities), so probabilities can be summed

To perform statistical analyses on messages or sequences, we need a reference model. The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner. This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment. Given a probability distribution, P i, for the letters in a i.i.d. message, the probability of seeing a particular sequence of letters i, j, k,... n is simply P i P j P k ···P n. As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log P i + log P j + log P k + ··· + log P n. In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score. To say the same more statistically…

Sequence alignment History of the Dynamic Programming (DP) algorithm 1970 Needleman-Wunsch global pair-wise alignment Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3): Smith-Waterman local pair-wise alignment Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147,

Pairwise sequence alignment Global dynamic programming MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution

How to determine similarity Frequent evolutionary events at the DNA level: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion We will restrict ourselves to these events

Three kinds of Gap Penalty schemes: Linear: gp(k)=ak Affine: gp(k)=b+ak (most commonly used) Concave (general): e.g. gp(k)=log(k) k is the gap length Dynamic programming b k gp(k)

– Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: gp(k)=  k Affine: gp(k)=  +  k Concave, e.g.: gp(k)=log(k) The score of an alignment is the sum of the scores over all alignment columns Dynamic programming Scoring alignments, General alignment score: S a,b =  k gp(k)

Dynamic programming Scoring alignments 101 Amino Acid Exchange Matrix Gap penalties (open, extension) 20  20 Score: s(T,T) + s(D,D) + s(W,W) + s(V,L) –P o -P x + + s(L,I) + s(K,K) T D W V T A L K T D W L - - I K

Constant vs. affine gap scoring Seq1G T A - -G - T - A Seq2- - A T G- A T G - Const-2 –2 1 –2 –2 (SUM = -7)-2 – –2 (SUM = -7) Affine – (SUM = -7)-3 – –3 (SUM = -11)‏ -2 extensionopening Gap Scoring -3Affine -2Constant …and +1 for match

Pairwise sequence alignment Global dynamic programming MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution

Dynamic programming i j The cell [i, j] contains the alignment score of the best scoring alignment of subsequence 1..i and 1..j, that is, the subsequences up to [i, j] Cell [i, j] does not ‘know’ what that best scoring alignment is (it is one out of many possibilities) Extend alignment from cell [i, j]

The DP algorithm Y[1…j - 1] Y[j] X[1…i - 1] X[i] Y[1…j-1]Y[j] X[1…i] - Y[1…j] - X[1…i-1] X[i] M[i, j-1] - gM[i-1, j] - g M[i-1, j-1] + m - s M[i,j] = Goal: find the maximal scoring alignment Dynamic programming  Solve smaller subproblem(s)‏  Iteratively extend the solution Scores: m match, -s mismatch, -g for insertion/deletion 3 ways to extend the alignment:

Global dynamic programming i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal Value from residue exchange matrix This is a recursive formula

The algorithm – final equation M[i, j] = M[i, j- 1 ] – g M[i- 1, j] – g M[i- 1, j- 1 ] + score(X[i],Y[j])‏ jj-1 i-1 i max * * -g X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j - Corresponds to:

Example: global alignment of two sequences Align two DNA sequences: –GAGTGA –GAGGCGA (note the length difference)‏ Parameters of the algorithm: –Match: score(A,A) = 1 –Mismatch: score(A,T) = -1 –Gap: g = -2 M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

The algorithm. Step 1: init Create the matrix Initiation –0 at [0,0] –Apply the equation… M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max jj ii A G C G G A G 0 - AGTGAG-

The algorithm. Step 1: init Initiation of the matrix: –0 at pos [0,0] –Fill in the first row using the “  ” rule –Fill in the first column using “  ” M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

The algorithm. Step 2: fill in Continue filling in of the matrix, remembering from which cell the result comes (arrows)‏ M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i

The algorithm. Step 2: fill in We are done… Where’s the result? M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max j i The lowest- rightmost cell

The algorithm. Step 3: traceback Start at the last cell of the matrix Go against the direction of arrows Sometimes the value may be obtained from more than one cell (which ones?)‏ j i M[i, j] = M[i, j- 1 ] – 2 M[i- 1, j] – 2 M[i- 1, j- 1 ] ± 1 max

The algorithm. Step 3: traceback Extract the alignments j i GAGT-GA GAGGCGA GA-GTGA GAGGCGA a b a)‏ b)‏ GAG-TGA GAGGCGA c)‏

Global dynamic programming i-1 i j-1 j H(i,j) = Max H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal Problem with simple DP approach: Can only do linear gap penalties Not suitable for affine or concave penalties, but algorithm can be extended to deal with affine penalties

Global dynamic programming using affine penalties i-2 i-1 i j-2 j-1 j If you came from here, gap was opened so apply gap-open penalty If you came from here, gap was already open, so apply gap-extension penalty Looking back from cell (i, j) we can adapt the algorithm for affine gap penalties by looking back to four more cells (magenta)

Affine gaps M[i, j] = I x [i-1, j-1] I y [i-1, j-1] M[i-1, j-1] max score(X[i], Y[j]) + I x [i, j] = M[i, j-1] + g o I x [i, j-1] + g e max I y [i, j] = M[i-1, j] + g o I y [i-1, j] + g x max Penalties: g o - gap opening (e.g. -8)‏ g e - gap extension (e.g. -1)‏ X 1 …X i - Y 1 …Y j-1 Y j X 1 … - - Y 1 …Y j-1 Y j X1… XiY1… YjX1… XiY1… home: think of boundary values M[*,0], I[*,0] etc.

Global dynamic programming all types of gap penalty i-1 j-1 S i,j = s i,j + Max Max{S 0<x<i-1, j-1 - Gap(i-x-1)} S i-1,j-1 Max{S i-1, 0<y<j-1 - Gap(j-y-1)} The complexity of this DP algorithm is increased from O(n 2 ) to O(n 3 ) The gap length is known exactly and so any gap penalty regime can be used

Global dynamic programming if affine penalties are used i-1 j-1 S i,j = s i,j + Max Max{S 0<x<i-1, j-1 -G o -(i-x-1)*G e } S i-1,j-1 Max{S i-1, 0<y<j-1 -G o -(j-y-1)*G e }

Global dynamic programming Linear, Affine or Concave gap penalties i-1 j-1 S i,j = s i,j + Max Max{S 0<x<i-1, j-1 - Pi - (i-x-1)Px} S i-1,j-1 Max{S i-1, 0<y<j-1 - Pi - (j-y-1)Px} Gap opening penalty Gap extension penalty All penalty schemes are possible because the exact length of the gap is known

DP is a two-step process Forward step: calculate scores Trace back: start at lower-right corner cell and reconstruct the path leading to the highest score –These two steps lead to the highest scoring global alignment (the optimal alignment) –This is guaranteed when you use DP!

Time and memory complexity of DP The time complexity is O(n 2 ): for aligning two sequences of n residues, you need to perform n 2 algorithmic steps (square search matrix has n 2 cells that need to be filled) The memory (space) complexity is also O(n 2 ): for aligning two sequences of n residues, you need a square search matrix of n by n containing n 2 cells