Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise Sequence Alignment (cont.)

Similar presentations


Presentation on theme: "Pairwise Sequence Alignment (cont.)"— Presentation transcript:

1 Pairwise Sequence Alignment (cont.)
(Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 6, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

2 4 Basic Questions in Pairwise Alignment
(Modeling evolution) Q1: How should we define s? Q2: How should we define A? (Application-specific) Model: scoring function s: A X=x1,…,xn X=x1,…,xn Possible alignments of X and Y: A ={a1,…,ak} Find the best alignment(s) S(a*)= 21 Y=y1,…,ym Y=y1,…,ym Q4: Is the alignment biologically Meaningful or just the best alignment of two unrelated sequences? Q3: How can we find a* quickly? (Dynamic programming) Q1 & Q4 are related! (Models for scores)

3 The Rest of This Lecture
Q4: How to assess the significance of an alignment score? Classic approach: extreme value distribution Bayesian approach: model comparison Q1: How to define the scoring function? Define the substitution score s Define the gap penalty function g

4 First, Q4: Assessing Score Signficance
In general, larger s  more significant. The question is how large should s be? Factors to be considered: Sequence length: longer sequences are expected to give higher scores # sequences in the database: the score of the best alignment is expected to be higher for a larger DB Evolution time: longer evolution causes more mismatches, making a lower score more significant The Challenge is how to quantify all these…

5 Log-odds score of the alignment
Two Basic Approaches The classical approach: Extreme value distribution Assume a null (random) model for scores M0 P(Score > s|M0, x, y)=? The Bayesian approach: Model comparison Assume two models for (x,y): random M0; aligned: M1 P(M1|x,y)/P(M0|x,y)=? prior Log-odds score of the alignment

6 Extreme Value Distribution
EVD: The asymptotic distribution of the maximum MN of a series of N independent normal random variables is In general, the maximum of a large number of separate scores follows this distribution Example: the best local match score between two long sequences constants mode

7 EVD of the Best Score in Ungapped Local Alignment
The number of unrelated local matches with score higher than S is approximately Poisson distributed, with mean The probability that there is a match of score greater than S is K and  can be fit using randomly generated data This gives a way to test statistical significance p(x>21)= 0.01 vs. p(x>21)=0.3 Parameters Sequence lengths

8 Bayesian Model Comparison
Assumptions: M is a model for related sequences R is a model for unrelated sequences (random) Ungapped alignment n=m Alignment of each pair is independent Score S(x,y) Prior (Subjective!) This partially addresses Q1: how to design the scoring function?

9 Q1: How to Estimate Probabilities?
General idea: Exploit sequences with known (“reliable”) alignments Simplest method: Max. Likelihood estimator Improved method: Consider evolution time (phylogenetic tree, to be covered later)

10 Dayhoff PAM Matrices Estimate p(b|a,t,M) (Substitution probabilities) rather than p(ba|M) Use sufficiently similar sequence pairs to estimate p(b|a,t=1,M) Compute p(b|a, t+1,M) based on p(b|a,t,M) Compute the score matrix (e.g., PAM 250)

11 BLOSUM Matrices Limitation of PAM: short time substitutions are dominated by trivial changes in the Codon triplets BLOSUM tries to improve the estimation of p(ab|M,t) by re-sampling the aligned, ungapped sequences regions (e.g., based on PAM) Time t is now connected with a threshold of sequence similarity, leading to different variations (e.g., BLOSUM50 & BLOSUM62)

12 Estimating Gap Penalties
Again the basic idea is to exploit known alignments Basic assumptions: The gap-open score d is linear in log(t) The gap-extend score e is constant Example: (g)=A+B*log(t)+C*log(g) In practice, people choose the gap costs empirically for given substitution scores.


Download ppt "Pairwise Sequence Alignment (cont.)"

Similar presentations


Ads by Google