Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by.

Similar presentations


Presentation on theme: ". Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by."— Presentation transcript:

1 . Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor.www.cs.huji.ac.il Background Readings: Gusfield, chapter 11. Durbin et. al., chapter 2.

2 2 Sequence Comparison Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences as strings of letters  DNA & RNA: alphabet ∑  of 4 letters (A,C,T/U,G) u Protein: alphabet ∑ of 20 letters (A,R,N,D,C,Q,…)

3 3 Sequence Comparison (cont) u Finding similarity between sequences is important for many biological questions For example: u Find similar proteins  Allows to predict function & structure u Locate similar subsequences in DNA  Allows to identify (e.g) regulatory elements u Locate DNA sequences that might overlap  Helps in sequence assembly g1g1 g2g2

4 4 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A

5 5 Alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three “components”: u Perfect matches u Mismatches u Insertions & deletions (indel) Formal definition of alignment:

6 6 Choosing Alignments There are many (how many?) possible alignments For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Which one is better?

7 7 Scoring Alignments Motivation: u Similar (“homologous”) sequences evolved from a common ancestor u In the course of evolution, the sequences changed from the ancestral sequence by random mutations:  Replacements: one letter changed to another  Deletion: deletion of a letter  Insertion: insertion of a letter u Scoring of sequence similarity should reflect how many and which operations took place

8 8 A Naive Scoring Rule Each position scored independently, using: u Match: +1 u Mismatch: -1 u Indel -2 Score of an alignment is sum of position scores

9 9 Example -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Score: (+1x13) + (-1x2) + (-2x4) = GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- Score: (+1x5) + (-1x6) + (-2x11) = -23 according to this scoring, first alignment is better

10 10 More General Scores u The choice of +1,-1, and -2 scores is quite arbitrary u Depending on the context, some changes are more plausible than others  Exchange of an amino-acid by one with similar properties (size, charge, etc.) vs.  Exchange of an amino-acid by one with very different properties u Probabilistic interpretation: (e.g.) How likely is one alignment versus another ?

11 11 Additive Scoring Rules u We define a scoring function by specifying  (x,y) is the score of replacing x by y  (x,-) is the score of deleting x  (-,x) is the score of inserting x u The score of an alignment is defined as the sum of position scores

12 12 The Optimal Score  The optimal (maximal) score between two sequences is the maximal score of all alignments of these sequences, namely, u Computing the maximal score or actually finding an alignment that yields the maximal score are closely related tasks with similar algorithms. u We now address these problems.

13 13 Computing Optimal Score u How can we compute the optimal score ?  If | s | = n and | t | = m, the number A( m,n ) of possible alignments is large! Exercise: Show that u So it is not a good idea to go over all alignments u The additive form of the score allows us to apply dynamic programming to compute optimal score efficiently.

14 14 Recursive Argument  Suppose we have two sequences: s[1..n+1] and t[1..m+1] The best alignment must be one of three cases: 1. Last match is ( s[n+1],t[m +1] ) 2. Last match is ( s[n +1], - ) 3. Last match is ( -, t[m +1] )

15 15 Recursive Argument  Suppose we have two sequences: s[1..n+1] and t[1..m+1] The best alignment must be one of three cases: 1. Last match is ( s[n+1],t[m +1] ) 2. Last match is ( s[n +1], - ) 3. Last match is ( -, t[m +1] )

16 16 Recursive Argument  Suppose we have two sequences: s[1..n+1] and t[1..m+1] The best alignment must be one of three cases: 1. Last match is ( s[n+1],t[m +1] ) 2. Last match is ( s[n +1], - ) 3. Last match is ( -, t[m +1] )

17 17 Useful Notation (ab)use of notation: V[i,j] = value of optimal alignment between i prefix of s and j prefix of t.

18 18 Recursive Argument (ab)use of notation: V[i,j] = value of optimal alignment between i prefix of s and j prefix of t.  Using our recursive argument, we get the following recurrence for V : V[i,j]V[i+1,j] V[i,j+1]V[i+1,j+1]

19 19 Recursive Argument u Of course, we also need to handle the base cases in the recursion (boundary of matrix): AA - We fill the “interior of matrix” using our recurrence rule S T versus

20 20 Dynamic Programming Algorithm We continue to fill the matrix using the recurrence rule S T

21 21 Dynamic Programming Algorithm V[0,0]V[0,1] V[1,0]V[1,1] A A- -2 (A- versus -A) versus S T

22 22 Dynamic Programming Algorithm S T (hey, what is the scoring function  (x,y) ? )

23 23 Dynamic Programming Algorithm Conclusion: V( AAAC, AGC ) = -1 S T

24 24 Reconstructing the Best Alignment u To reconstruct the best alignment, we record which case(s) in the recursive rule maximized the score S T

25 25 Reconstructing the Best Alignment u We now trace back a path that corresponds to the best alignment AAAC AG-C S T

26 26 Reconstructing the Best Alignment u More than one alignment could have the best score (sometimes, even exponentially many) S T AAAC A-GC AAAC -AGC AAAC AG-C

27 27 Time Complexity Space: O(mn) Time: O(mn)  Filling the matrix O(mn)  Backtrace O(m+n) S T

28 28 Space Complexity  In real-life applications, n and m can be very large u The space requirements of O(mn) can be too demanding  If m = n = 1000, we need 1MB space  If m = n = 10000, we need 100MB space u We can afford to perform extra computation to save space  Looping over million operations takes less than seconds on modern workstations u Can we trade space with time?

29 29 Why Do We Need So Much Space?  Compute V(i,j), column by column, storing only two columns in memory (or line by line if lines are shorter) A 1 G 2 C 3 0 A 1 A 2 A 3 C 4 Note however that  This “trick” fails if we want to reconstruct the optimal alignment.  Trace back information requires keeping all back pointers, O(mn) memory. To compute just the value V[n,m]=V(s[1..n],t[1..m]), we need only O(min(n,m)) space:

30 30 Local Alignment The alignment version we studies so far is called global alignment: We align the whole sequence s to the whole sequence t. Global alignment is appropriate when s,t are highly similar (examples?), but makes little sense if they are highly dissimilar. For example, when s (“the query”) is very short, but t (“the database”) is very long.

31 31 Local Alignment When s and t are not necessarily similar, we may want to consider a different question:  Find similar subsequences of s and t  Formally, given s[1..n] and t[1..m] find i,j,k, and l such that V(s[i..j],t[k..l]) is maximal u This version is called local alignment.

32 32 Local Alignment u As before, we use dynamic programming  We now want to set V[i,j] to record the maximum value over all alignments of a suffix of s[1..i] and a suffix of t[1..j]  In other words, we look for a suffix of a prefix. u How should we change the recurrence rule?  Same as before but with an option to start afresh u The result is called the Smith-Waterman algorithm, after its inventors (1981).

33 33 Local Alignment New option: u We can start a new match instead of extending a previous alignment Alignment of empty suffixes

34 34 Local Alignment Example s = TAATA t = TACTAA S T

35 35 Local Alignment Example s = TAATA t = TACTAA S T

36 36 Local Alignment Example s = TAATA t = TACTAA S T


Download ppt ". Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by."

Similar presentations


Ads by Google