Presentation is loading. Please wait.

Presentation is loading. Please wait.

Expected accuracy sequence alignment

Similar presentations


Presentation on theme: "Expected accuracy sequence alignment"— Presentation transcript:

1 Expected accuracy sequence alignment
Usman Roshan

2 Optimal pairwise alignment
Sum of pairs (SP) optimization: find the alignment of two sequences that maximizes the similarity score given an arbitrary cost matrix. We can find the optimal alignment in O(mn) time and space using the Needleman-Wunsch algorithm. Recursion: Traceback: where M(i,j) is the score of the optimal alignment of x1..i and y1..j, s(xi,yj) is a substitution scoring matrix, and g is the gap penalty

3 Affine gap penalties Affine gap model allows for long insertions in distant proteins by charging a lower penalty for extension gaps. We define g as the gap open penalty (first gap) and e as the gap extension penalty (additional gaps) Alignment: ACACCCT ACACCCC ACCT T AC CTT Score = Score = 0.9 Trivial cost matrix: match=+1, mismatch=0, gapopen=-2, gapextension=-0.1

4 Affine penalty recursion
M(i,j) denotes alignments of x1..i and y1..j ending with a match/mismatch. E(i,j) denotes alignments of x1..i and y1..j such that yj is paired with a gap. F(i,j) defined similarly. Recursion takes O(mn) time where m and n are lengths of x and y respectively.

5 Expected accuracy alignment
The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative. We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one.

6 Posterior probability of xi aligned to yj
Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*. We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as Do et. al., Genome Research, 2005

7 Expected accuracy of alignment
We can define the expected accuracy of an alignment a as The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm Do et. al., Genome Research, 2005

8 Example for expected accuracy
True alignment AC_CG ACCCA Expected accuracy=( )/4=1 Estimated alignment ACC_G Expected accuracy=( ) ~ 0.75

9 Estimating posterior probabilities
If correct posterior probabilities can be computed then we can compute the correct alignment. Now it remains to estimate these probabilities from the data PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998) Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices

10 Partition function posterior probabilities
Standard alignment score: Probability of alignment (Miyazawa, Prot. Eng. 1995) If we knew the alignment partition function then

11 Partition function posterior probabilities
Alignment partition function (Miyazawa, Prot. Eng. 1995) Subsequently

12 Partition function posterior probabilities
More generally the forward partition function matrices are calculated as

13 Partition function matrices vs. standard affine recursions

14 Posterior probability calculation
If we defined Z’ as the “backward” partition function matrices then

15 Posterior probabilities using alignment ensembles
By generating an ensemble A(n,x,y) of n alignments of x and y we can estimate P(xi~yj) by counting the number of times xi is aligned to yj.. Note that this means we are assigning equal weights to all alignments in the ensemble.

16 Generating ensemble of alignments
We can use stochastic backtracking (Muckstein et. al., Bioinformatics, 2002) to generate a given number of optimal and suboptimal alignments. At every step in the traceback we assign a probability to each of the three possible positions. This allows us to “sample” alignments from their partition function probability distribution. Posteror probabilities turn out to be the same when calculated using forward and backward partition function matrices.

17 Probalign For each pair of sequences (x,y) in the input set
a. Compute partition function matrices Z(T) b. Estimate posterior probability matrix P(xi ~ yj) for (x,y) by Perform the probabilistic consistency transformation and compute a maximal expected accuracy multiple alignment: align sequence profiles along a guide-tree and follow by iterative refinement (Do et. al.).

18 Multiple protein alignment
Protein sequence alignment: hard problem for multiple distantly related proteins Several standard protein alignment benchmarks available: BAliBASE, HOMSTRAD, OXBENCH, PREFAB, and SABMARK Benchmark alignments are based on manual and computational structural alignment of proteins with known structure.

19 Measure of accuracy Sum-of-pairs score: number of correctly aligned pairs divided by number of pairs in true alignment. Column score: number of correctly aligned columns Statistical significance using Friedman rank test Blue: correct Red: incorrect Acc: 2/4=50% AACAGT AA_ _GT AACAGT AAGT_ _

20 Experimental design Methods compared:
Probalign PROBCONS MUSCLE MAFFT Probalign temperature parameter trained on RV11 subset of BAliBASE 3.0. Default (optimized) parameters for remaining programs All experiments performed on CIPRES cluster at SDSC

21 BAliBASE 3.0 Sum-of-pairs and column score accuracies Data Probalign
MAFFT Probcons MUSCLE RV11 69.3 / 45.3 67.1 / 44.6 67.0 / 41.7 59.3 / 35.9 RV12 94.6 / 86.2 93.6 / 83.8 94.1 / 85.5 91.7 / 80.4 RV20 92.6 / 43.9 92.7 / 45.3 91.7 / 40.6 89.2 / 35.1 RV30 85.2 / 56.4 85.6 / 56.9 84.5 / 54.4 80.3 / 38.3 RV40 92.2 / 60.3 92.0 / 59.7 90.3 / 53.2 86.7 / 47.1 RV50 89.3 / 55.2 90.0 / 56.2 89.4 / 57.3 85.7 / 48.7 All 87.6 / 58.9 87.1 / 58.6 86.4 / 55.8 82.5 / 48.5 Friedman rank test P-values Method RV11 RV12 RV20 RV30 RV40 RV50 All MAFFT NS < 0.005 Probcons 0.049 0.0233 MUSCLE 0.008

22 Heterogeneous length data I
BAliBASE datasets with maximum length and minimum devation Max length / Standard dev. Probalign MAFFT Probcons MUSCLE 500 / 100 88.4 / 56.6 88.0 / 58.0 86.7 / 51.6 81.5 / 42.5 500 / 200 88.5 / 54.6 87.0 / 51.9 87.2 / 48.9 81.9 / 42.4 1000 / 100 91.4 / 58.1 90.4 / 55.7 89.7 / 51.6 84.3 / 44.1 1000 / 200 90.7 / 55.0 89.3 / 51.4 89.2 / 48.7 83.2 / 42.5 BAliBASE datasets with long extensions Max length / Standard dev. Probalign MAFFT Probcons RV / 100 (25) 1000 / 200 (20) 92.7 / 59.3 93.0 / 57.3 91.0 / 54.8 90.8 / 52.1 89.9 / 48.2 90.6 / 47.6

23 Heterogeneous length data II
BAliBASE 2.0 reference 6 datasets with max length and minimum deviation Max length / Standard dev. Probalign MAFFT Probcons 500 / 100 (40) 89.1 / 44.9 87.3 / 49.0 87.4 / 38.6 500 / 200 (21) 88.3 / 43.8 85.0 / 46.4 86.7 / 40.0 500 / 300 (9) 95.3 / 61.0 82.6 / 51.3 87.3 / 46.6 500 / 400 (5) 94.6 / 55.0 72.0 / 38.2 79.8 / 38.0 1000 / 100 (15) 90.2 / 43.3 82.4 / 36.9 85.4 / 27.6 1000 / 200 (12) 89.2 / 38.2 79.7 / 32.4 83.6 / 27.7 1000 / 300 (7) 94.5 / 52.8 78.3 / 42.4 83.9 / 34.6 1000 / 400 (5)


Download ppt "Expected accuracy sequence alignment"

Similar presentations


Ads by Google