Cédric Notredame (19/10/2015) Using Dynamic Programming To Align Sequences Cédric Notredame.

Slides:



Advertisements
Similar presentations
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Advertisements

Eugene W.Myers and Webb Miller. Outline Introduction Gotoh's algorithm O(N) space Gotoh's algorithm Main algorithm Implementation Conclusion.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Expected accuracy sequence alignment
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Sequence Alignment Oct 9, 2002 Joon Lee Genomics & Computational Biology.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
BNFO 602 Multiple sequence alignment Usman Roshan.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Alignment II Dynamic Programming
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Class 2: Basic Sequence Alignment
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Expected accuracy sequence alignment Usman Roshan.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
DNA, RNA and protein are an alien language
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Cédric Notredame (22/02/2016) Comparing Two Protein Sequences Cédric Notredame.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Learning to Align: a Statistical Approach
CS502: Algorithms in Computational Biology
INTRODUCTION TO BIOINFORMATICS
Introduction to Dynamic Programming
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Global, local, repeated and overlaping
Using Dynamic Programming To Align Sequences
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Presentation transcript:

Cédric Notredame (19/10/2015) Using Dynamic Programming To Align Sequences Cédric Notredame

Cédric Notredame (19/10/2015) Our Scope Coding a Global and a Local Algorithm Understanding the DP concept Aligning with Affine gap penalties Sophisticated variants… Saving memory

Cédric Notredame (19/10/2015) Outline -Coding Dynamic Programming with Non-affine Penalties -Adding affine penalties -Turning a global algorithm into a local Algorithm -Using A Divide and conquer Strategy -The repeated Matches Algorithm -Double Dynamic Programming -Tailoring DP to your needs:

Cédric Notredame (19/10/2015) Global Alignments Without Affine Gap penalties Dynamic Programming

Cédric Notredame (19/10/2015) How To align Two Sequences With a Gap Penalty, A Substitution matrix and Not too Much Time Dynamic Programming

Cédric Notredame (19/10/2015) A bit of History… -DP invented in the 50s by Bellman -Programming  Tabulation -Re-invented in 1970 by Needlman and Wunsch -It took 10 year to find out…

Cédric Notredame (19/10/2015) The Foolish Assumption The score of each column of the alignment is independent from the rest of the alignment It is possible to model the relationship between two sequences with: -A substitution matrix -A simple gap penalty

Cédric Notredame (19/10/2015) The Principal of DP If you extend optimally an optimal alignment of two sub-sequences, the result remains an optimal alignment X-XX XXXX X-X- XXXX -X-X Deletion Alignment Insertion ???? +

Cédric Notredame (19/10/2015) Finding the score of i,j -Sequence 1: [1-i] -Sequence 2: [1-j] -The optimal alignment of [1-i] vs [1-j] can finish in three different manners: X-X- XXXX -X-X

Cédric Notredame (19/10/2015) Finding the score of i,j i-i- ijij -j-j 1…i 1…j-1 1…i-1 1…j-1 1…i-1 1…j Three ways to build the alignment 1…i 1…j

Cédric Notredame (19/10/2015) Finding the score of i,j 1…i-1 1…j-1 1…i 1…j-1 1…i-1 1…j In order to Compute the score of 1…i 1…j All we need are the scores of:

Cédric Notredame (19/10/2015) Formalizing the algorithm F(i,j)= best F(i,j-1) + Gep F(i-1,j-1) + Mat[i,j] F(i-1,j) + Gep X-X- XXXX -X-X 1…i 1…j-1 1…i-1 1…j-1 1…i-1 1…j + + +

Cédric Notredame (19/10/2015) Arranging Everything in a Table -FA - F A S T T 1…I-1 1…J-1 1…I 1…J-1 1…I-1 1…J 1…I 1…J

Cédric Notredame (19/10/2015) Taking Care of the Limits In a Dynamic Programming strategy, the most delicate part is to take care of the limits: -what happens when you start -what happens when you finish The DP strategy relies on the idea that ALL the cells in your table have the same environment… This is NOT true of ALL the cells!!!!

Cédric Notredame (19/10/2015) Taking Care of the Limits -FA - F A S T T -4 Match=2 MisMatch=-1 Gap=-1 -3 FAT --- F-F- -2 FA -- F-F- -2 FA FAS --- 0

Cédric Notredame (19/10/2015) Filing Up The Matrix

Cédric Notredame (19/10/2015) -FA - F A S T -3 T

Cédric Notredame (19/10/2015) Delivering the alignment: Trace-back Score of 1…3 Vs 1…4  Optimal Aln Score TTTT S-S- AAAA FFFF

Cédric Notredame (19/10/2015) Trace-back: possible implementation while (!($i==0 && $j==0)) { if ($tb[$i][$j]==$sub) #SUBSTITUTION { $alnI[$aln_len]=$seqI[--$i]; $alnJ[$aln_len]=$seqJ[--$j]; } elsif ($tb[$i][$j]==$del) #DELETION { $alnI[$aln_len]='-'; $alnJ[$aln_len]=$seqJ[--$j]; } elsif ($tb[$i][$j]==$ins) #INSERTION { $alnI[$aln_len]=$seqI[0][--$i]; $alnJ[$aln_len]='-'; } $aln_len++; }

Cédric Notredame (19/10/2015) Local Alignments Without Affine Gap penalties Smith and Waterman

Cédric Notredame (19/10/2015) Getting rid of the pieces of Junk between the interesting bits Smith and Waterman

Cédric Notredame (19/10/2015)

The Smith and Waterman Algorithm F(i,j)= best F(i-1,j) + Gep F(i-1,j-1) + Mat[i,j] F(i,j-1) + Gep X-X- XXXX -X-X 1…i 1…j-1 1…i-1 1…j-1 1…i-1 1…j

Cédric Notredame (19/10/2015) The Smith and Waterman Algorithm F(i,j)= best F(i-1,j) + Gep F(i-1,j-1) + Mat[i,j] F(i,j-1) + Gep 0

Cédric Notredame (19/10/2015) The Smith and Waterman Algorithm 0  Ignore The rest of the Matrix  Terminate a local Aln

Cédric Notredame (19/10/2015) Filing Up a SW Matrix 0

Cédric Notredame (19/10/2015) Filling up a SW matrix: borders * -ANICECAT C 0 A 0 T 0 A 0 N 0 D 0 O 0 G 0 Easy: Local alignments NEVER start/end with a gap…

Cédric Notredame (19/10/2015) Filling up a SW matrix * -ANICECAT C A T A N D O G Best Local score  Beginning of the trace-back

Cédric Notredame (19/10/2015) for ($i=1; $i<=$len0; $i++) { for ($j=1; $j<=$len1; $j++) { if ($res0[0][$i-1] eq $res1[0][$j-1]){$s=2;} else {$s=-1;} $sub=$mat[$i-1][$j-1]+$s; $del=$mat[$i ][$j-1]+$gep; $ins=$mat[$i-1][$j ]+$gep; if ($sub>$del && $sub>$ins && $sub>0) {$smat[$i][$j]=$sub;$tb[$i][$j]=$subcode;} elsif($del>$ins && $del>0 ) {$smat[$i][$j]=$del;$tb[$i][$j]=$delcode;} elsif( $ins>0 ) {$smat[$i][$j]=$ins;$tb[$i][$j]=$inscode;} else {$smat[$i][$j]=$zero;$tb[$i][$j]=$stopcode;} if ($smat[$i][$j]> $best_score) { $best_score=$smat[$i][$j]; $best_i=$i; $best_j=$j; } Prepare Trace back Turning NW into SW

Cédric Notredame (19/10/2015) A few things to remember SW only works if the substitution matrix has been normalized to give a Negative score to a random alignment. Chance should not pay when it comes to local alignments !

Cédric Notredame (19/10/2015) More than One match… -SW delivers only the best scoring Match -If you need more than one match: -SIM (Huang and Millers) Or -Waterman and Eggert (Durbin, p91)

Cédric Notredame (19/10/2015) Waterman and Eggert -Iterative algorithm: -1-identify the best match -2-redo SW with used pairs forbidden -Delivers a collection of non-overlapping local alignments -Avoid trivial variations of the optimal. -3-finish when the last interesting local extracted

Cédric Notredame (19/10/2015) Adding Affine Gap Penalties The Gotoh Algorithm

Cédric Notredame (19/10/2015) Forcing a bit of Biology into your alignment The Gotoh Formulation

Cédric Notredame (19/10/2015) Why Affine gap Penalties are Biologically better Cost L Afine Gap Penalty GOP GEP GOP Parsimony: Evolution takes the simplest path (So We Think…) Cost=gop+L*gep Or Cost=gop+(L-1)*gep

Cédric Notredame (19/10/2015) But Harder To compute… More Than 3 Ways to extend an Alignment X-XX XXXX X-X- XXXX -X-X Deletion Alignment Insertion ???? + Opening Extension Opening Extension

Cédric Notredame (19/10/2015) More Questions Need to be asked For instance, what is the cost of an insertion ? 1…I-1 ??X 1…J-1 ??X 1…I ??- 1…J ??X 1…I ??- 1…J-1 ??X GOP GEP

Cédric Notredame (19/10/2015) Solution:Maintain 3 Tables Ix: Table that contains the score of every optimal alignment 1…i vs 1…j that finishes with an Insertion in sequence X. Iy: Table that contains the score of every optimal alignment 1…I vs 1…J that finishes with an Insertion in sequence Y. M: Table that contains the score of every optimal alignment 1…I vs 1…J that finishes with an alignment between sequence X and Y

Cédric Notredame (19/10/2015) The Algorithm M(i,j)= best M(i-1,j-1) + Mat(i,j) XXXX 1…i-1 1…j-1 + Ix(i-1,j-1) + Mat(i,j) Iy(i-1,j-1) + Mat(i,j) X-X- 1…i-1 X 1…j X + Ix(i,j)= best M(i-1,j) + gop Ix(i-1,j) + gep X-X- 1…i-1 X 1…j - + -X-X 1…i X 1…j-1 X + Iy(i,j)= best M(i,j-1) + gop Iy(i,j-1) + gep -X-X 1…i - 1…j-1 X +

Cédric Notredame (19/10/2015) Trace-back? M IxIy Start From BEST M(i,j) Ix(i,j) Iy(i,j)

Cédric Notredame (19/10/2015) Trace-back? M Iy Navigate from one table to the next, knowing that a gap always finishes with an aligned column… Ix

Cédric Notredame (19/10/2015) Going Further ? With the affine gap penalties, we have increased the number of possibilities when building our alignment. CS talk of states and represent this as a Finite State Automaton (FSA are HMM cousins)

Cédric Notredame (19/10/2015) Going Further ?

Cédric Notredame (19/10/2015) Going Further ? In Theory, there is no Limit on the number of states one may consider when doing such a computation.

Cédric Notredame (19/10/2015)

Going Further ? Imagine a pairwise alignment algorithm where the gap penalty depends on the length of the gap. Can you simplify it realistically so that it can be efficiently implemented?

Cédric Notredame (19/10/2015) Ly Lx

Cédric Notredame (19/10/2015) A divide and Conquer Strategy The Myers and Miller Strategy

Cédric Notredame (19/10/2015) Remember Not To Run Out of Memory The Myers and Miller Strategy

Cédric Notredame (19/10/2015) A Score in Linear Space You never Need More Than The Previous Row To Compute the optimal score

Cédric Notredame (19/10/2015) A Score in Linear Space For I For J R2[i][j]=best For J, R1[j]=R2[j] R1 R2 R2[j-1], +gep R1[j-1]+mat R1[j]+gep

Cédric Notredame (19/10/2015) A Score in Linear Space

Cédric Notredame (19/10/2015) A Score in Linear Space You never Need More Than The Previous Row To Compute the optimal score You only need the matrix for the Trace-Back, Or do you ????

Cédric Notredame (19/10/2015) An Alignment in Linear Space Forward Algorithm F(i,j)=Optimal score of 0…i Vs 0…j Backward algorithm B(i,j)=Optimal score of M…i Vs N…j B(i,j)+F(i,j)=Optimal score of the alignment that passes through pair i,j

Cédric Notredame (19/10/2015) An Alignment in Linear Space Backward algorithm Forward Algorithm Optimal B(i,j)+F(i,j) Backward algorithm Forward Algorithm

Cédric Notredame (19/10/2015)

An Alignment in Linear Space Backward algorithm Forward Algorithm Recursive divide and conquer strategy: Myers and Miller (Durbin p35)

Cédric Notredame (19/10/2015) An Alignment in Linear Space

Cédric Notredame (19/10/2015) A Forward-only Strategy(Durbin, p35) Forward Algorithm -Keep Row M in memory -Keep track of which Cell in Row M lead to the optimal score -Divide on this cell M

Cédric Notredame (19/10/2015) M M

An interesting application: finding sub-optimal alignments Backward algorithm Forward Algorithm Backward algorithm Forward Algorithm Sum over the Forw/Bward and identify the score of the best aln going through cell i,j

Cédric Notredame (19/10/2015) Application: Non-local models Double Dynamic Programming

Cédric Notredame (19/10/2015) Outline The main limitation of DP: Context independent measure

Cédric Notredame (19/10/2015) Double Dynamic Programming High Level Smith and Waterman Dynamic Programming Score=Max S(i-1, j-1)+RMSd score S(i, j-1)+gp { Rigid Body Superposition where i and j are forced together RMSd Score

Cédric Notredame (19/10/2015) Double Dynamic Programming

Cédric Notredame (19/10/2015) Application: Repeats The Durbin Algorithm

Cédric Notredame (19/10/2015)

In The End: Wraping it Up

Cédric Notredame (19/10/2015) Dynamic Programming Needleman and Wunsch: Delivers the best scoring global alignment Smith and Waterman: NW with an extra state 0 Affine Gap Penalties: Making DP more realistic

Cédric Notredame (19/10/2015) Dynamic Programming Linear space: Using Divide and Conquer Strategies Not to run out of memory Double Dynamic Programming, repeat extraction: DP can easily be adapted to a special need