Download presentation
Presentation is loading. Please wait.
1
Pairwise Sequence Alignment (II)
(Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from
2
Review: Dynamic Programming for LCS
T C w Edit graph representation of alignment Path = alignment Incrementally fill in the table Backtrack to find the best alignment 1 2 3 4 5 6 7 A T G v 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 3 1 2 2 3 3 3 3 4 1 2 2 3 4 4 4 5 1 2 2 3 4 4 4 6 1 2 2 3 4 5 5 1 7 2 2 3 4 5 5
3
The LCS Recurrence Revisited
The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0: Insertion/deletion score Matching score si-1, j-1+1 if vi = wj si,j = max si-1, j + 0 si, j-1 + 0 How do we improve scoring?
4
How do we improve the scoring of alignments?
Can we still find an alignment efficiently?
5
Outline Improve Scoring Variants of Alignment
Scoring Matrix Affine Gap Penalty Variants of Alignment Global vs. Local alignment Assessing Score Significance
6
The same dynamic programming algorithm would still work!
Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score with comparison of a gap character “-”. This will simplify the scoring algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { The same dynamic programming algorithm would still work!
7
The Global Alignment Problem
Find the best alignment between two strings under a given scoring matrix Input : Strings v & w and a scoring matrix δ Output : Alignment of maximum score Algorithm: Dynamic programming si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { The only question left is how to define the scoring matrix…
8
Measuring Similarity Measuring the extent of similarity between two sequences Based on percent sequence identity Based on conservation
9
Percent Sequence Identity
The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical
10
Simple Scoring When mismatches are penalized by some constant –μ, indels are penalized by some other constant –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels)
11
Making a Better Scoring Matrix
Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations in the sequence. Some of these mutations have little effect on the organism’s function, therefore some penalties, δ(vi , wj), will be less harsh than others.
12
Scoring Matrix: Example
K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) = 11
13
Scoring matrices Amino acid substitution matrices PAM BLOSUM
DNA substitution matrices DNA: less conserved than protein sequences Less effective to compare coding regions at nucleotide level Simple scoring is often used
14
PAM some residues may have mutated several times
Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may have not changed at all
15
Think of PAM1 as 1-step transitions and PAM250 as 250-step transitions
PAMX PAMx = PAM1x PAM250 = PAM1250 PAM250 is a widely used scoring matrix: Think of PAM1 as 1-step transitions and PAM250 as 250-step transitions Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A Arg R Asn N Asp D Cys C Gln Q ... Trp W Tyr Y Val V
16
BLOSUM Blocks Substitution Matrix
Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUMx was created using sequences sharing no more than x% identity E.g., BLOSUM62 <-> 62% identity
17
The Blosum50 Scoring Matrix
Probability of seeing x aligned with y Val(x,y)=log(p(x,y)/p(x)p(y)) Probability of seeing x (or y) alone
18
Deficiency in Scoring of Indels
A fixed penalty σ is given to every indel: -σ when there is 1 indel, -2σ for 2 consecutive indels, -3σ for 3 consecutive indels, etc. Can be too severe penalty for a series of 100 consecutive indels
19
Deficiency in Scoring of Indels (cont.)
In nature, many times indels come as a unit, not just at 1 nucleotide at a time. ATA__GC ATATTGC ATAG_GC AT_GTGC In nature, this is more likely. Normal scoring would give the same score for both alignments
20
Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(ρ + σx), where ρ >0 is the penalty for introducing a gap. ρ will be large relative to σ because you do not want to add too much of a penalty for extending the gap.
21
Affine Gap Penalties Gap penalties:
-ρ-σ when there is 1 indels, -ρ-2σ when there are 2 indels, -ρ-3σ when there are 3 indels, etc. -ρ- x * σ (-gap opening - x gap extensions) Somehow reduced penalties (as compared to naïve scoring) are given to runs of horizontal and vertical edges
22
Affine Gap Penalty Recurrences
si,j = s i-1,j - σ max s i-1,j –(ρ+σ) si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,j Continue Gap in w (deletion) Start Gap in w (deletion) Continue Gap in v (insertion) Start Gap in v (insertion) Match or Mismatch End deletion End insertion Once again, the same dynamic programming algorithm would work!
23
Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph.
24
Local vs. Global Alignment (cont’d)
Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc
25
Local Alignments: Why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence
26
The Local Alignment Problem
Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix δ Output : Alignment of substrings of v & w whose alignment score is maximum among all possible alignment of all possible substrings
27
Local Alignment in Edit Graph
Compute a “mini” Global Alignment to get Local Local alignment Global alignment
28
The Problem with this Problem
Problem of this, long run time O(n4): - There are ~n2 pairs of vertices (i,j) - For each pair of vertices computing an alignment takes O(n2) time. Solution: Dynamic programming again! Question: How do we recursively compute the best score of any local (as opposed to global) alignment for each cell in the edit graph?
29
The Local Alignment Recurrence
The largest value of si,j over the whole edit graph is the score of the best local alignment. The recurrence is shown below: Notice there is only this change from the original recurrence of a Global Alignment si,j = max si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { This is the well-known Waterman-Smith local alignment algoirthm
30
Assessing Score Signficance
In general, larger s more significant. The question is how large should s be? Factors to be considered: Sequence length: longer sequences are expected to give higher scores # sequences in the database: the score of the best alignment is expected to be higher for a larger DB Evolution time: longer evolution causes more mismatches, making a lower score more significant The Challenge is how to quantify all these…
31
Log-odds score of the alignment
Two Basic Approaches The classical approach: Extreme value distribution (EVD) Assume a null (random) model for scores MR P(Score > s|MR, a(x, y))=? (a(x,y)=alignment of x, y) The Bayesian approach: Model comparison Assume two models for a(x,y): random R; aligned: M P(M|a(x,y))/P(R|a(x,y))=? Log-odds score of the alignment prior
32
EVD of the Best Score in Ungapped Local Alignment
The number of unrelated local matches with score higher than S is approximately Poisson distributed, with mean The probability that there is a match of score greater than S is K and can be fit using randomly generated data This gives a way to test statistical significance p(x>21)= 0.01 vs. p(x>21)=0.3 Parameters Sequence lengths
33
Bayesian Model Comparison
Assumptions: M is a model for related sequences R is a model for unrelated sequences (random) Ungapped alignment n=m Alignment of each pair is independent BLOSUM Scoring Score S(x,y) Prior (Subjective!) This partially addresses Q1: how to design the scoring function?
34
Pairwise Alignment Summary
(Modeling evolution) Q1: How should we define s? Q2: How should we define A? (Application-specific) Model: scoring function s: A X=x1,…,xn X=x1,…,xn Possible alignments of X and Y: A ={a1,…,ak} Find the best alignment(s) … S(a*)= 21 Y=y1,…,ym Y=y1,…,ym Q4: Is the alignment biologically Meaningful or just the best alignment of two unrelated sequences? Q3: How can we find a* quickly? (Dynamic programming) Q1 & Q4 are related! (Models for scores)
35
What You Should Know Alignment Scoring Methods (Matrix & Gap)
Global vs. Local alignments How the dynamic programming algorithm solves both local and global alignments with a number of scoring strategies Basic idea in assessing score significance
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.