Lecture 5: Local Sequence Alignment Algorithms

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Measuring the degree of similarity: PAM and blosum Matrix
Sequence Alignment.
Introduction to Bioinformatics
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture 6, Thursday April 17, 2003
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence Alignment - III Chitta Baral. Scoring Model When comparing sequences –Looking for evidence that they have diverged from a common ancestor by.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Alignment.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Cédric Notredame (19/10/2015) Using Dynamic Programming To Align Sequences Cédric Notredame.
Comp. Genomics Recitation 3 The statistics of database searching.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
CS 5263 Bioinformatics Lecture 6: Sequence Alignment Statistics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Learning to Align: a Statistical Approach
CS502: Algorithms in Computational Biology
Lectures 3-6: Pair-wise Sequence Alignment
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Using Dynamic Programming To Align Sequences
Pairwise sequence Alignment.
#7 Still more DP, Scoring Matrices
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment (cont.)
Lecture 6: Sequence Alignment Statistics
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Variants of HMMs.
Sequence comparison: Significance of similarity scores
Basic Local Alignment Search Tool
Presentation transcript:

Lecture 5: Local Sequence Alignment Algorithms CS 5263 Bioinformatics Lecture 5: Local Sequence Alignment Algorithms

Poll Who have learned and still remember Finite State Machine/Automata, regular grammar, and context-free grammar?

Roadmap Review of last lecture Local Sequence Alignment Statistics of sequence alignment Substitution matrix Significance of alignment

Bounded Dynamic Programming x1 ………………………… xM O(kM) time O(kM) memory Possibly O(M+k) yN ………………………… y1 k

Linear-space alignment O(M+N) memory 2MN time M/2 k* M/2 N-k*

Graph representation of seq alignment -1 -2 -3 -4 1 2 (0,0) -1 -1 -1 -1 -1 -1 1 1 1 1 (3,4) An optimal alignment is a longest path from (0, 0) to (m,n) on the alignment graph

Question If I change the scoring scheme, will it change the alignment? Match = 1, mismatch = gap = -2 || v Match = 2, mismatch = gap = -1? Answer: Yes

Proof Let F1 be the score of an optimal alignment under the scoring scheme Match = m > 0 Mismatch = s < 0 Gap = d < 0 Let a1, b1, c1 be the number of matches, mismatches, and gaps in the alignment F1 = a1m + b1s + c1d

Proof (cont’) Let F2 be the score of a sub-optimal alignment under the same scoring scheme Let a2, b2, c2 be the number of matches, mismatches, and gaps in the alignment F2 = a2m + b2s + c2d Let F1 = F2 + k, where k > 0

Proof (cont’) Now we change the scoring scheme, so that Match = m + 1 Mismatch = s + 1 Gap = d + 1

Proof (cont’) The new scores for the two alignments become: F1’= a1 * (m+1) + b1 * (s + 1) + c1 * (d + 1) = a1m + b1s + c1d + (a1+b1+c1) = F1 + (a1+b1+c1) = F1 + L1 F2’ = a2 * (m+1) + b2 * (s + 1) + c2 * (d + 1) = F2 + (a2+b2+c2) = F2 + L2 length of alignment 1 length of alignment 2

Proof (cont’) F1’ – F2’ = F1 – F2 + (a1+b1+c1) – (a2+b2+c2) = k + (a1+b1+c1) – (a2+b2+c2) = k + L1 – L2 In order for F1’ < F2’, we need to have: k + L1 – L2 < 0, i.e. L2 – L1 > k Length of alignment 1 Length of alignment 2

Proof (cont’) This means, if under the original scoring scheme, F1 is greater than F2 by k, but the length of alignment 2 is at least (k+1) greater than that of alignment 1, F2’ will be greater than F1’ under the new scoring scheme. We only need to show one example that it is possible to find such two alignments

d F1 = 2m + 3s F2 = 3m + 4d m m s d m d m s s d

F1 = 2m + 3s F2 = 3m + 4d m = 1, s = d = –2 F1 = 2 – 6 = –4 F1 > F2 m d m s s d

F1 = 2m + 3s F2 = 3m + 4d m = 2, s = d = – 1 F1’ = 4 – 3 = 1 F2’ > F1’ m d m s s d

A A C A G AACAG | | ATCGT F1 = 2x1-3x2 = -4 F1’ = 2x2 – 3x1 = 1 m A m T m C AA-CAG- | | | -ATC-GT F2 = 3x1 – 4x2 = -5 F2’ = 3x2 – 4x1 = 2 m G T

On the other hand, if we had doubled our scores, such that m’ = 2m, s’ = 2s d’ = 2d F1’ = 2F1 F2’ = 2F2 Our alignment won’t be changed

Today How to model gaps more accurately? Local sequence alignment Statistics of alignment

What’s a better alignment? GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG Score = 8 x m – 3 x d Score = 8 x m – 3 x d However, gaps usually occur in bunches. During evolution, chunks of DNA may be lost entirely Aligning genomic sequence vs. cDNA (reverse complimentary to mRNA)

Model gaps more accurately Current model: Gap of length n incurs penalty nd General: Convex function E.g. (n) = c * sqrt (n)  n  n

General gap dynamic programming Initialization: same Iteration: F(i-1, j-1) + s(xi, yj) F(i, j) = max maxk=0…i-1F(k,j) – (i-k) maxk=0…j-1F(i,k) – (j-k) Termination: same Running Time: O(N2M) (cubic) Space: O(NM) (linear-space algorithm not applicable)

Compromise: affine gaps (n) = d + (n – 1)e | | gap gap open extension e d Match: 2 Gap open: 5 Gap extension: 1 GACGCCGAACG ||||| ||| GACGC---ACG GACGCCGAACG |||| | | || GACG-C-A-CG 8x2-5-2 = 9 8x2-3x5 = 1

Additional states The amount of state needed increases In scoring a single entry in our matrix, we need remember an extra piece of information Are we continuing a gap in x? (if no, start is more expensive) Are we continuing a gap in y? (if no, start is more expensive) Are we continuing from a match between xi and yj?

Finite State Automaton Xi aligned to a gap  d Xi and Yj aligned  d Yj aligned to a gap e 

Dynamic programming We encode this information in three different matrices For each element (i,j) we use three variables F(i,j): best alignment of x1..xi & y1..yj if xi aligns to yj Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap

F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) d F(i – 1, j – 1) F(i, j) = (xi, yj) + max Ix(i – 1, j – 1) Iy(i – 1, j – 1) F(i, j – 1) – d Ix(i, j) = max Iy(i, j – 1) – d Ix(i, j – 1) – e F(i – 1, j) – d Iy(i, j) = max Ix(i – 1, j) – d Iy(i – 1, j) – e Continuing alignment Closing gaps in x Closing gaps in y Opening a gap in x Gap extension in x Opening a gap in y Gap extension in y

F Ix Iy

F Ix Iy

If we stack all three matrices No cyclic dependency We can fill in all three matrices in order

Algorithm for i = 1:m F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) for j = 1:n Fill in F(i, j), Ix(i, j), Iy(i, j) end F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N)) Time: O(MN) Space: O(MN) or O(N) when combine with the linear-space algorithm

To simplify F(i – 1, j – 1) + (xi, yj) F(i, j) = max I(i – 1, j – 1) + (xi, yj) F(i, j – 1) – d I (i, j) = max I(i, j – 1) – e F(i – 1, j) – d I(i – 1, j) – e I(i, j): best alignment between x1…xi and y1…yj if either xi or yj is aligned to a gap This is possible because no alternating gaps allowed

To summarize Global alignment Basic algorithm: Needleman-Wunsch Variants: Overlapping detection Longest common subsequences Achieved by varying initial conditions or scoring Bounded DP (pruning search space) Linear space (divide-and-conquer) Affine gap penalty

Local alignment

The local alignment problem Given two strings X = x1……xM, Y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum e.g. X = abcxdex X’ = cxde Y = xxxcde Y’ = c-de x y

Why local alignment Conserved regions may be a small part of the whole “Active site” of a protein Scattered genes or exons among “junks” Don’t have whole sequence Global alignment might miss them if flanking “junk” outweighs similar regions

Genes are shuffled between genomes C D B D A C A B C D B D C A

Naïve algorithm for all substrings X’ of X and Y’ of Y Align X’ & Y’ via dynamic programming Retain pair with max value end ; Output the retained pair Time: O(n2) choices for A, O(m2) for B, O(nm) for DP, so O(n3m3 ) total.

Reminder The overlap detection algorithm We do not give penalty to gaps in the ends Free gap Free gap

Similar here We are free of penalty for the unaligned regions

The big idea Whenever we get to some bad region (negative score), we ignore the previous alignment Reset score to zero

The Smith-Waterman algorithm Initialization: F(0, j) = F(i, 0) = 0 F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + (xi, yj) Iteration: F(i, j) = max

The Smith-Waterman algorithm Termination: If we want the best local alignment… FOPT = maxi,j F(i, j) If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace back

The correctness of the algorithm can be proved by induction using the alignment graph -10 100

x c d e a b Match: 2 Mismatch: -1 Gap: -1

x c d e a b Match: 2 Mismatch: -1 Gap: -1

x c d e a b 2 1 Match: 2 Mismatch: -1 Gap: -1

x c d e a b 2 1 Match: 2 Mismatch: -1 Gap: -1

x c d e a b 2 1 3 Match: 2 Mismatch: -1 Gap: -1

x c d e a b 2 1 3 5 Match: 2 Mismatch: -1 Gap: -1

x c d e a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1

Trace back x c d e a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1

Trace back x c d e a b 2 1 3 5 4 cxde | || c-de x-de | || xcde a b 2 1 3 5 4 Match: 2 Mismatch: -1 Gap: -1 cxde | || c-de x-de | || xcde

No negative values in local alignment DP array Optimal local alignment will never have a gap on either end Local alignment: “Smith-Waterman” Global alignment: “Needleman-Wunsch”

Analysis Time: Memory: O(MN) for finding the best alignment Depending on the number of sub-opt alignments Memory: O(MN) O(M+N) possible

The statistics of alignment Where does (xi, yj) come from? Are two aligned sequences actually related?

Probabilistic model of alignments We’ll focus on protein alignments without gaps Given an alignment, we can consider two possibilities R: the sequences are related by evolution U: the sequences are unrelated How can we distinguish these possibilities? How is this view related to amino-acid substitution matrix?

Model for unrelated sequences Assume each position of the alignment is independently sampled from some distribution of amino acids ps: probability of amino acid s in the sequences Probability of seeing an amino acid s aligned to an amino acid t by chance is Pr(s, t | U) = ps * pt Probability of seeing an ungapped alignment between x = x1…xn and y = y1…yn randomly is

Model for related sequences Assume each pair of aligned amino acids evolved from a common ancestor Let qst be the probability that amino acid s in one sequence is related to t in another sequence The probability of an alignment of x and y is give by

Probabilistic model of Alignments How can we decide which possibility (U or R) is more likely? One principled way is to consider the relative likelihood of the two possibilities (the odd ratios) A higher ratio means that R is more likely than U

Log odds ratio Taking the log, we get Recall that the score of an alignment is given by

Therefore, if we define We are actually defining the alignment score as the log odds ratio (log likelihood) between the two models R and U This is indeed how biologists have defined the substitution matrices for proteins

ps can be counted from the available protein sequences But how do we get qst? (the probability that s and t have a common ancestor) Counted from trusted alignments of related sequences

Protein Substitution Matrices Two popular sets of matrices for protein sequences PAM matrices [Dayhoff et al, 1978] Better for aligning closely related sequences BLOSUM matrices [Henikoff & Henikoff, 1992] For both closely or remotely related sequences

Positive for chemically similar substitution Common amino acids get low weights Rare amino acids get high weights

BLOSUM-N matrices Constructed from a database called BLOCKS Contain many closely related sequences Conserved amino acids may be over-counted N = 62: the probabilities qst were computed using trusted alignments with no more than 62% identity identity: % of matched columns Using this matrix, the Smith-Waterman algorithm is most effective in detecting real alignments with a similar identity level (i.e. ~62%)

If you want to detect homologous genes with high identify, you may want a BLOSUM matrix with higher N. say BLOSUM75 On the other hand, if you want to detect remote homology, you may want to use lower N, say BLOSUM50 BLOSUM62 is the standard

For DNAs No database of trusted alignments to start with Specify the percentage identity you would like to detect You can then get the substitution matrix by some calculation

For example Suppose pA = pC = pT = pG = 0.25 We want 88% identity qAA = qCC = qTT = qGG = 0.22, the rest = 0.12/12 = 0.01 (A, A) = (C, C) = (G, G) = (T, T) = log (0.22 / (0.25*0.25)) = 1.26 (s, t) = log (0.01 / (0.25*0.25)) = -1.83 for s ≠ t.

Substitution matrix A C G T 1.26 -1.83

A C G T 5 -7 Scale won’t change the alignment Multiply by 4 and then round off to get integers

Arbitrary substitution matrix Say you have a substitution matrix provided by someone It’s important to know what you are actually looking for when you use the matrix

What’s the difference? Which one should I use? A C G T 1 -2 A C G T 5 -4 What’s the difference? Which one should I use?

We had Scale it, so that Reorganize:

Since all probabilities must sum to 1, We have Suppose again ps = 0.25 for any s We know (s, t) from the substitution matrix We can solve the equation for λ Plug λ into to get qst

A C G T 1 -2 A C G T 5 -4 Translate: 95% identity  = 1.33 qst = 0.24 for s = t, and 0.004 for s ≠ t Translate: 95% identity  = 1.21 qst = 0.16 for s = t, and 0.03 for s ≠ t Translate: 65% identity