Presentation on theme: "Sequence Alignment I Lecture #2"— Presentation transcript:
1 Sequence Alignment I Lecture #2 Background Readings: Gusfield, chapter 11.Durbin et. al., chapter 2.This class has been edited from Nir Friedman’s lecture which is available at Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor..
2 Sequence Comparison Much of bioinformatics involves sequences DNA sequencesRNA sequencesProtein sequencesWe can think of these sequences as strings of lettersDNA & RNA: alphabet ∑ of 4 letters (A,C,T/U,G)Protein: alphabet ∑ of 20 letters (A,R,N,D,C,Q,…)
3 Sequence Comparison (cont) Finding similarity between sequences is important for many biological questionsFor example:Find similar proteinsAllows to predict function & structureLocate similar subsequences in DNAAllows to identify (e.g) regulatory elementsLocate DNA sequences that might overlapHelps in sequence assemblyg1g2
4 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequencesExample:GCGCATGGATTGAGCGATGCGCCATTGATGACCAA possible alignment:-GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A
5 Alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three “components”: Perfect matchesMismatchesInsertions & deletions (indel)Formal definition of alignment:
6 Choosing Alignments There are many (how many?) possible alignments For example, compare:-GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-Ato------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA--Which one is better?
7 Scoring Alignments Motivation: Similar (“homologous”) sequences evolved from a common ancestorIn the course of evolution, the sequences changed from the ancestral sequence by random mutations:Replacements: one letter changed to anotherDeletion: deletion of a letterInsertion: insertion of a letterScoring of sequence similarity should reflect how many and which operations took place
8 A Naive Scoring Rule Each position scored independently, using: Match: +1Mismatch : -1Indel -2Score of an alignment is sum of position scores
9 Example -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = 3------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA--Score: (+1x5) + (-1x6) + (-2x11) = -23according to this scoring, first alignment is better
10 More General ScoresThe choice of +1,-1, and -2 scores is quite arbitraryDepending on the context, some changes are more plausible than othersExchange of an amino-acid by one withsimilar properties (size, charge, etc.)vs.Exchange of an amino-acid by one with very different propertiesProbabilistic interpretation: (e.g.) How likely is one alignment versus another ?
11 Additive Scoring Rules We define a scoring function by specifying(x,y) is the score of replacing x by y(x,-) is the score of deleting x(-,x) is the score of inserting xThe score of an alignment is defined as thesum of position scores
12 The Optimal ScoreThe optimal (maximal) score between two sequences is the maximal score of all alignments of these sequences, namely,Computing the maximal score or actually finding an alignment that yields the maximal score are closely related tasks with similar algorithms.We now address these problems.
13 Computing Optimal Score How can we compute the optimal score ?If |s| = n and |t| = m, the number A(m,n) of possible alignments is large!Exercise: Show thatSo it is not a good idea to go over all alignmentsThe additive form of the score allows us to apply dynamic programming to compute optimal score efficiently.הנוסחה כנראה לא מדוייקת
14 Recursive ArgumentSuppose we have two sequences: s[1..n+1] and t[1..m+1]The best alignment must be one of three cases:1. Last match is (s[n+1],t[m +1] )2. Last match is (s[n +1],-)3. Last match is (-, t[m +1] )
15 Recursive ArgumentSuppose we have two sequences: s[1..n+1] and t[1..m+1]The best alignment must be one of three cases:1. Last match is (s[n+1],t[m +1] )2. Last match is (s[n +1],-)3. Last match is (-, t[m +1] )
16 Recursive ArgumentSuppose we have two sequences: s[1..n+1] and t[1..m+1]The best alignment must be one of three cases:1. Last match is (s[n+1],t[m +1] )2. Last match is (s[n +1],-)3. Last match is (-, t[m +1] )
17 Useful Notation (ab)use of notation: V[i,j] = value of optimal alignment betweeni prefix of s and j prefix of t.
18 Recursive Argument (ab)use of notation: V[i,j] = value of optimal alignment betweeni prefix of s and j prefix of t.Using our recursive argument, we get the following recurrence for V:V[i,j]V[i+1,j]V[i,j+1]V[i+1,j+1]
19 Recursive ArgumentOf course, we also need to handle the base cases in the recursion (boundary of matrix):STAA- -versusWe fill the “interior of matrix” using our recurrence rule
20 Dynamic Programming Algorithm STWe continue to fill the matrix using the recurrence rule
21 Dynamic Programming Algorithm STV[0,0]V[0,1]V[1,0]V[1,1]+1-2 -AA--2 (A- versus -A)versus
22 Dynamic Programming Algorithm ST(hey, what is the scoring function s(x,y) ? )
24 Reconstructing the Best Alignment To reconstruct the best alignment, we record which case(s) in the recursive rule maximized the scoreST
25 Reconstructing the Best Alignment We now trace back a path that corresponds to the best alignmentAAACAG-CST
26 Reconstructing the Best Alignment More than one alignment could have the best score(sometimes, even exponentially many)STAAACA-GCAAAC-AGCAG-C
27 Time Complexity Space: O(mn) Time: O(mn) Filling the matrix O(mn) Backtrace O(m+n)ST
28 Space Complexity In real-life applications, n and m can be very large The space requirements of O(mn) can be too demandingIf m = n = 1000, we need 1MB spaceIf m = n = 10000, we need 100MB spaceWe can afford to perform extra computation to save spaceLooping over million operations takes less than seconds on modern workstationsCan we trade space with time?
29 Why Do We Need So Much Space? To compute just the value V[n,m]=V(s[1..n],t[1..m]),we need only O(min(n,m)) space:A1G2C34Compute V(i,j), column by column, storing only two columns in memory(or line by line if lines are shorter).-2-4-6-8-21-1-3-5-4-1-2-6-3-2-1Note however thatThis “trick” fails if we want to reconstruct the optimal alignment.Trace back information requires keeping all back pointers, O(mn) memory.
30 Local Alignment The alignment version we studies so far is called global alignment: We align the whole sequence sto the whole sequence t.Global alignment is appropriate when s,t are highlysimilar (examples?), but makes little sense if theyare highly dissimilar. For example, when s (“the query”)is very short, but t (“the database”) is very long.
31 Local AlignmentWhen s and t are not necessarily similar, we may want to consider a different question:Find similar subsequences of s and tFormally, given s[1..n] and t[1..m] find i,j,k, and l such that V(s[i..j],t[k..l]) is maximalThis version is called local alignment.
32 Local Alignment As before, we use dynamic programming We now want to setV[i,j] to record the maximum value over all alignments of a suffix of s[1..i]and a suffix of t[1..j]In other words, we look for a suffix of a prefix.How should we change the recurrence rule?Same as before but with an option to start afreshThe result is called the Smith-Waterman algorithm, after its inventors (1981).
33 Local Alignment New option: We can start a new match instead of extending a previous alignmentAlignment of empty suffixes