Expected accuracy sequence alignment Usman Roshan.

Slides:



Advertisements
Similar presentations
Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Usman Roshan BNFO 601.
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
BNFO 602 Multiple sequence alignment Usman Roshan.
Lecture 6, Thursday April 17, 2003
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
1 Protein Multiple Alignment by Konstantin Davydov.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Expected accuracy sequence alignment
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
BNFO 602 Multiple sequence alignment Usman Roshan.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Alignment II Dynamic Programming
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Introduction to Profile Hidden Markov Models
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Cédric Notredame (19/10/2015) Using Dynamic Programming To Align Sequences Cédric Notredame.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Copyright (c) 2002 by SNU CSE Biointelligence Lab 1 Chap. 4 Pairwise alignment using HMMs Biointelligence Laboratory School of Computer Sci. & Eng. Seoul.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Expected accuracy sequence alignment Usman Roshan.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Learning to Align: a Statistical Approach
BNFO 602 Lecture 2 Usman Roshan.
Using Dynamic Programming To Align Sequences
BNFO 602 Lecture 2 Usman Roshan.
Affine gaps for sequence alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Presentation transcript:

Expected accuracy sequence alignment Usman Roshan

Optimal pairwise alignment Sum of pairs (SP) optimization: find the alignment of two sequences that maximizes the similarity score given an arbitrary cost matrix. We can find the optimal alignment in O(mn) time and space using the Needleman-Wunsch algorithm. Recursion: Traceback: where M(i,j) is the score of the optimal alignment of x 1..i and y 1..j, s(x i,y j ) is a substitution scoring matrix, and g is the gap penalty

Affine gap penalties Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps) Alignment: –ACACCCTACACCCC –AC-CT-TAC--CTT –Score = 0 Score = 0.9 Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1

Affine penalty recursion M(i,j) denotes alignments of x 1..i and y 1..j ending with a match/mismatch. E(i,j) denotes alignments of x 1..i and y 1..j such that y j is paired with a gap. F(i,j) defined similarly. Recursion takes O(mn) time where m and n are lengths of x and y respectively.

Expected accuracy alignment The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative. We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.

Posterior probability of x i aligned to y j Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*. We define the posterior probability of the i th residue of x (x i ) aligning to the j th residue of y (y j ) in the true alignment (a*) of x and y as Do et. al., Genome Research, 2005

Expected accuracy of alignment We can define the expected accuracy of an alignment a as The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm Do et. al., Genome Research, 2005

Example for expected accuracy True alignment AC_CG ACCCA Expected accuracy=( )/4=1 Estimated alignment ACC_G ACCCA Expected accuracy=( )/4 ~ 0.75

Estimating posterior probabilities If correct posterior probabilities can be computed then we can compute the correct alignment. Now it remains to estimate these probabilities from the data PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998) Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices

HMM posterior probabilities Consider the probability of all alignments of sequences X and Y under a given HMM. Let M(i,j) be the sum of the probabilities of all alignments of X 1...i and Y 1…j that end in match or mismatch. Then M(i,j) is given by We calculate X(i,j) and Y(i,j) in the same way. We call these forward probabilities: –f(i,j) = M(i,j)+X(i,j)+Y(i,j)

HMM posterior probabilities Similarly we can calculate backward probabilties M’(i,j). Define M’(i,j) as the sum of probabilities of all alignments of X i..m and Y j..n such that X i and and Y j are aligned to each other. The indices i and j start from m and n respectively and decrease These are also called backward probabilities. –B(i,j)=M’(i,j)+X’(i,j)+Y’(i,j)

HMM posterior probabilities The posterior probability of x i aligned to y j is given by

Partition function posterior probabilities Standard alignment score: Probability of alignment (Miyazawa, Prot. Eng. 1995) If we knew the alignment partition function then

Partition function posterior probabilities Alignment partition function (Miyazawa, Prot. Eng. 1995) Subsequently

Partition function posterior probabilities More generally the forward partition function matrices are calculated as

Partition function matrices vs. standard affine recursions

Posterior probability calculation If we defined Z’ as the “backward” partition function matrices then

Posterior probabilities using alignment ensembles By generating an ensemble A(n,x,y) of n alignments of x and y we can estimate P(x i ~y j ) by counting the number of times x i is aligned to y j.. Note that this means we are assigning equal weights to all alignments in the ensemble.

Generating ensemble of alignments We can use stochastic backtracking (Muckstein et. al., Bioinformatics, 2002) to generate a given number of optimal and suboptimal alignments. At every step in the traceback we assign a probability to each of the three possible positions. This allows us to “sample” alignments from their partition function probability distribution. Posteror probabilities turn out to be the same when calculated using forward and backward partition function matrices.

Probalign 1.For each pair of sequences (x,y) in the input set –a. Compute partition function matrices Z(T) –b. Estimate posterior probability matrix P(x i ~ y j ) for (x,y) by 2.Perform the probabilistic consistency transformation and compute a maximal expected accuracy multiple alignment: align sequence profiles along a guide-tree and follow by iterative refinement (Do et. al.).

Experimental results ontent/26/16/1958http://bioinformatics.oxfordjournals.org/c ontent/26/16/1958