Presentation on theme: "Sequence Alignment I Lecture #2"— Presentation transcript:
1Sequence Alignment I Lecture #2 Background Readings: Gusfield, chapter 11.Durbin et. al., chapter 2.This class has been edited from Nir Friedman’s lecture which is available at Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor..
2Sequence Comparison Much of bioinformatics involves sequences DNA sequencesRNA sequencesProtein sequencesWe can think of these sequences as strings of lettersDNA & RNA: alphabet ∑ of 4 letters (A,C,T/U,G)Protein: alphabet ∑ of 20 letters (A,R,N,D,C,Q,…)
3Sequence Comparison (cont) Finding similarity between sequences is important for many biological questionsFor example:Find similar proteinsAllows to predict function & structureLocate similar subsequences in DNAAllows to identify (e.g) regulatory elementsLocate DNA sequences that might overlapHelps in sequence assemblyg1g2
4Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequencesExample:GCGCATGGATTGAGCGATGCGCCATTGATGACCAA possible alignment:-GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A
5Alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three “components”: Perfect matchesMismatchesInsertions & deletions (indel)Formal definition of alignment:
6Choosing Alignments There are many (how many?) possible alignments For example, compare:-GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-Ato------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA--Which one is better?
7Scoring Alignments Motivation: Similar (“homologous”) sequences evolved from a common ancestorIn the course of evolution, the sequences changed from the ancestral sequence by random mutations:Replacements: one letter changed to anotherDeletion: deletion of a letterInsertion: insertion of a letterScoring of sequence similarity should reflect how many and which operations took place
8A Naive Scoring Rule Each position scored independently, using: Match: +1Mismatch : -1Indel -2Score of an alignment is sum of position scores
9Example -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA--Score: (+1x5) + (-1x6) + (-2x11) = -23according to this scoring, first alignment is better
10More General ScoresThe choice of +1,-1, and -2 scores is quite arbitraryDepending on the context, some changes are more plausible than othersExchange of an amino-acid by one withsimilar properties (size, charge, etc.)vs.Exchange of an amino-acid by one with very different propertiesProbabilistic interpretation: (e.g.) How likely is one alignment versus another ?
11Additive Scoring Rules We define a scoring function by specifying(x,y) is the score of replacing x by y(x,-) is the score of deleting x(-,x) is the score of inserting xThe score of an alignment is defined as thesum of position scores
12The Optimal ScoreThe optimal (maximal) score between two sequences is the maximal score of all alignments of these sequences, namely,Computing the maximal score or actually finding an alignment that yields the maximal score are closely related tasks with similar algorithms.We now address these problems.
13Computing Optimal Score How can we compute the optimal score ?If |s| = n and |t| = m, the number A(m,n) of possible alignments is large!Exercise: Show thatSo it is not a good idea to go over all alignmentsThe additive form of the score allows us to apply dynamic programming to compute optimal score efficiently.הנוסחה כנראה לא מדוייקת
14Recursive ArgumentSuppose we have two sequences: s[1..n+1] and t[1..m+1]The best alignment must be one of three cases:1. Last match is (s[n+1],t[m +1] )2. Last match is (s[n +1],-)3. Last match is (-, t[m +1] )
15Recursive ArgumentSuppose we have two sequences: s[1..n+1] and t[1..m+1]The best alignment must be one of three cases:1. Last match is (s[n+1],t[m +1] )2. Last match is (s[n +1],-)3. Last match is (-, t[m +1] )
16Recursive ArgumentSuppose we have two sequences: s[1..n+1] and t[1..m+1]The best alignment must be one of three cases:1. Last match is (s[n+1],t[m +1] )2. Last match is (s[n +1],-)3. Last match is (-, t[m +1] )
17Useful Notation (ab)use of notation: V[i,j] = value of optimal alignment betweeni prefix of s and j prefix of t.
18Recursive Argument (ab)use of notation: V[i,j] = value of optimal alignment betweeni prefix of s and j prefix of t.Using our recursive argument, we get the following recurrence for V:V[i,j]V[i+1,j]V[i,j+1]V[i+1,j+1]
19Recursive ArgumentOf course, we also need to handle the base cases in the recursion (boundary of matrix):STAA- -versusWe fill the “interior of matrix” using our recurrence rule
20Dynamic Programming Algorithm STWe continue to fill the matrix using the recurrence rule
21Dynamic Programming Algorithm STV[0,0]V[0,1]V[1,0]V[1,1]+1-2 -AA--2 (A- versus -A)versus
22Dynamic Programming Algorithm ST(hey, what is the scoring function s(x,y) ? )
28Space Complexity In real-life applications, n and m can be very large The space requirements of O(mn) can be too demandingIf m = n = 1000, we need 1MB spaceIf m = n = 10000, we need 100MB spaceWe can afford to perform extra computation to save spaceLooping over million operations takes less than seconds on modern workstationsCan we trade space with time?
29Why Do We Need So Much Space? To compute just the value V[n,m]=V(s[1..n],t[1..m]),we need only O(min(n,m)) space:A1G2C34Compute V(i,j), column by column, storing only two columns in memory(or line by line if lines are shorter).-2-4-6-8-21-1-3-5-4-1-2-6-3-2-1Note however thatThis “trick” fails if we want to reconstruct the optimal alignment.Trace back information requires keeping all back pointers, O(mn) memory.
30Local Alignment The alignment version we studies so far is called global alignment: We align the whole sequence sto the whole sequence t.Global alignment is appropriate when s,t are highlysimilar (examples?), but makes little sense if theyare highly dissimilar. For example, when s (“the query”)is very short, but t (“the database”) is very long.
31Local AlignmentWhen s and t are not necessarily similar, we may want to consider a different question:Find similar subsequences of s and tFormally, given s[1..n] and t[1..m] find i,j,k, and l such that V(s[i..j],t[k..l]) is maximalThis version is called local alignment.
32Local Alignment As before, we use dynamic programming We now want to setV[i,j] to record the maximum value over all alignments of a suffix of s[1..i]and a suffix of t[1..j]In other words, we look for a suffix of a prefix.How should we change the recurrence rule?Same as before but with an option to start afreshThe result is called the Smith-Waterman algorithm, after its inventors (1981).
33Local Alignment New option: We can start a new match instead of extending a previous alignmentAlignment of empty suffixes