Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.

Similar presentations


Presentation on theme: "6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science."— Presentation transcript:

1 6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science and Mathematics 10 ¦ 22 ¦ 2002

2 6/11/2015 © Bud Mishra, 2001 L7-2 Local Alignment Problem (LAP) Finding substrings of high similarity: Given two strings, S 1 and S 2 : They may have regions that are locally highly similar.

3 6/11/2015 © Bud Mishra, 2001 L7-3 LAP Local Alignment Problem Given: Two strings S 1 and S 2 Find: Substrings  v S 1 and  v S 2 whose similarity (in terms of an object function—e.g., optimal global alignment value) is maximum over all pairs of substrings from S 1 and S 2 v * = max  v S1,  v S2 distance( ,  )

4 6/11/2015 © Bud Mishra, 2001 L7-4 Example d(x,x) = 2,d(x,y) = -2 d(x,-) = d(-,x) = -1.  = a x a b c s v S 1  = a x b a c s v S 2 S_1 = p q r a x a b c s t v q  S_2 = x y a x b a c s l l  Local Alignment: a x a b - c s | | | | | a x - b a c s 2 2 -1 2 -1 2 2 distance( ,  ) = 8

5 6/11/2015 © Bud Mishra, 2001 L7-5 Naïve Complexity Note: (1) Let |S 1 | = n and |S 2 | = m. –Total number of substrings of S 1 = C n+1,2 = O(n 2 ) –Total number of substrings of S 2 = C m+1,2 = O(m 2 ) –Naïvely, O(n 2 m 2 )candidate substrings need to be globally aligned by a DP algorithm of complexity O(|  | |  |) Complexity of the resulting algorithm = O(n 3 m 3 ) (2) An improved algorithm (SWAT, Smith-Waterman) reduces the time complexity to O(nm)

6 6/11/2015 © Bud Mishra, 2001 L7-6 LSAP Local Suffix Alignment Problem A restricted version of the LAP. Given: Two strings S 1 and S 2 and two indices i 5 |S 1 | and j 5 |S 2 | –A i = S 1 [1..i] prefix of S 1 –B j = S 2 [1..j] prefix of S 2 Find: A suffix (possibly empty, ) of A i (  = S 1 [k..i]) and a suffix of B j (possibly empty, ) of B j (  = S 2 [l..j]) that maximizes a linear objective function V( ,  ) over all pairs of suffixes of A i and B j. ð

7 6/11/2015 © Bud Mishra, 2001 L7-7 Objective Function v(i,j) = max  = suf S1[1..i],  = suf S2[1..j] V( ,  ) = Value of the optimal local suffix alignment for the given index pair i, j. v * = max i 5 n, j 5 m v(i,j) = Value of the optimal local alignment. n = |S 1 |, m = |S 2 |

8 6/11/2015 © Bud Mishra, 2001 L7-8 Optimal Local Alignment Recurrence Equations v * = max i 5 n, j 5 m V(i,j)  = suf S [ 1..i],  = suf S 2 [1..j] v * = v(i’, j’) = V( ,  ) Consider an optimal suffix alignment with  = suf S 1 [1..i] and  = suf S 2 [1..j] Case 1:  =  = (= empty string) –Base: V( ,  ) = 0

9 6/11/2015 © Bud Mishra, 2001 L7-9 Optimal Local Alignment Recurrence Equations Case 2: ,  =  ‘ ± S 1 [i] and S 1 [i] matches “-” –Ind(A): V( ,  ) = V(  ’,  ) + d(S 1 [i], -) …or S 1 [i] matches S 2 [j] (  =  ’ ± S 2 [j]) –Ind(C): V( ,  ) = V(  ’,  ’) + d(S [ i], S 2 [j])

10 6/11/2015 © Bud Mishra, 2001 L7-10 Optimal Local Alignment Recurrence Equations Case 3:  ,  =  ’ ± S 2 [j] and S 2 [j] matches “-” –Ind(B): V( ,  ) = V( ,  ’) + d(-, S 2 [j]) …or S 1 [i] matches S 2 [j] (  =  ’ ± S 1 [i]) –Ind(C): V( ,  ) = V(  ’,  ’) + d(S [ i], S 2 [j])

11 6/11/2015 © Bud Mishra, 2001 L7-11 Recurrence Equation V(i,j) = max  = suf S1[1..i],  = suf S2[1..j] V( ,  ) Base: v(i,j)| i=0 Ç j=0 = 0 (v(0,0) = v(i,0) = v(0,j) = 0) Induction: v(i,j)| i=0 Æ j=0 =max[0, v(i-1,j) + d(S 1 [i],-), v(i,j-1)+ d(-, S 2 [j]), v(i-1,j-1), d(S 1 [i], S 2 [j]) ]

12 6/11/2015 © Bud Mishra, 2001 L7-12 Dynamic Programming Table (with Traceback) Compute all v(i,j) entries: Complexity = O(nm) Find v * = v(i *, j * ) by finding the largest value in any cell: Complexity = O(nm) Trace the pointer back from from v(I *, j * ) until a cell is reached with value v(i’,j’) =0: Complexity = O(n+m) Results:  = S 1 [i’..i * ] v S 1 and  = S 2 [j’..j * ] v S 2 Total Complexity = O(nm) = O(|S 1 |, |S 2 |)

13 6/11/2015 © Bud Mishra, 2001 L7-13 Example.xyaxbacsll.00000000000 p00000000000 q00000000000 r00000000000 a000 -2-2 Ã 1Ã 10 -2-2 Ã 1000 x0 -2-2 Ã 1Ã 1 " 1" 1 -4-4 Ã 3Ã 3 Ã 2Ã 2 Ã 1Ã 1000 a0 " 1" 10 -3-3 " 3" 3 " 2" 2 -5-5 Ã 4Ã 4 Ã 3Ã 3 Ã 2Ã 2 Ã 1Ã 1 b000 " 2" 2 " 2" 2 -5-5 Ã 4Ã 4 Ã 3Ã 3 Ã 2Ã 2 Ã 1Ã 1 Ã 0Ã 0 c000 " 1" 1 " 1" 1 " 4" 4 " 3" 3 -6-6 Ã 5Ã 5 Ã 4Ã 4 Ã 3Ã 3 s00000 " 3" 3 " 2" 2 " 5" 5 -8-8 Ã 7Ã 7 Ã 7Ã 7 t00000 " 2" 2 " 1" 1 " 4" 4 " 7" 7 Ã 6Ã 6 Ã 6Ã 6 v00000 " 1" 10 " 3" 3 " 6" 6 Ã 5Ã 5 Ã 5Ã 5 q0000000 " 2" 2 " 5" 5 Ã 4Ã 4 Ã 4Ã 4

14 6/11/2015 © Bud Mishra, 2001 L7-14 Dealing with Gaps A gap is any “maximal consecutive run of spaces” in a single string of a given alignment. c t t t a a c - - a - a c c - - - c a c c c a t - c gap, g 1 gap, g 2 gap, g 3 gap, g 4

15 6/11/2015 © Bud Mishra, 2001 L7-15 Gaps Initial Gap –A gap may be bordered on the right by the first character of a string. Final Gap –A gap may be bordered on the left by the last character of a string. Internal Gap –A gap may be bordered on both left and right Simple Gap Penalty Model  Constant Wt, W g –Each gap contributes a constant penalty = W g –d(x,x) = 2, d(x,y) = -2, d(x,-) = d(-,y) = 0 –# gaps = k. Then –Value of an alignment =  i=1 l d(S’ 1 [i], S’ 2 [i]) – k W g

16 6/11/2015 © Bud Mishra, 2001 L7-16 Biological Motivations for Gap Models –Unequal Crossing-over in Meiosis –DNA slippage during replication –Insertion of transposable elements (“Jumping Genes”) –Insertion by retroviruses –Translocation between chromosomes Examples of Alignment with gaps: –cDNA matching problem –Processed Pseudo-gene Problem

17 6/11/2015 © Bud Mishra, 2001 L7-17 Gap Weights Constant: –Each gap has a penalty of W g –Each space is free: d(x,-) = d(-,x) = 0. Affine: –Gap initiation weight = W g –Gap Extension weight = W s –Each gap of length q has a penalty of W g + q W s Convex: –Each gap of length q has a penalty of W g + ln q W s Arbitrary: –Each gap of length q has a penalty of W g +  ( q) W s, where  (q) = arbitrary function

18 6/11/2015 © Bud Mishra, 2001 L7-18 General Model Arbitrary: –Each gap of length q has a penalty of W g +  ( q) W s, where  (q) = arbitrary function –  (q) = 0  constant –  (q) = q  linear/affine –  (q) = ln q  convex Total Cost under constant model  i=1 l d(S’ 1 [i], S’ 2 [i]) – (#gaps) W g Total Cost under affine model  i=1 l d(S’ 1 [i], S’ 2 [i]) – (#gaps) W g – (#spaces) W s

19 6/11/2015 © Bud Mishra, 2001 L7-19 Local Alignment under Arbitrary Gap Weight Model Dynamic Programming (Needleman & Wunsch) Given two strings S 1 and S 2 start by aligning the prefixes –S 1,i = S 1 [1..i] and –S 2,j = S 2 [1..j] There are three different cases to consider…

20 6/11/2015 © Bud Mishra, 2001 L7-20 Case 1 S 1 [i] is aligned to a character strictly to the left of a character S 2 [j] S 1,i S 2,j S 1 [i] S 2 [j]

21 6/11/2015 © Bud Mishra, 2001 L7-21 Case 2 S 1 [i] is aligned to a character strictly to the right of a character S 2 [j] S 1,i S 2,j S 1 [i] S 2 [j]

22 6/11/2015 © Bud Mishra, 2001 L7-22 Case 3 S 1 [i] and S 2 [j] are aligned opposite each other: –Subcase A S 1 [i] = S 2 [j] –Subcase B S 1 [i]  S 2 [j] S 1,i S 2,j S 1 [i] S 2 [j]

23 6/11/2015 © Bud Mishra, 2001 L7-23 Auxiliary Vaiables X L (i,j) = max alignments for case 1 distance(S 1 [1..i], S 2 [1..j]) X R (i,j) = max alignments for case 2 distance(S 1 [1..i], S 2 [1..j]) X S (i,j) = max alignments for case 3 distance(S 1 [1..i], S 2 [1..j]) V(i,j) = max(X L (i,j), X R (i,j), X S (i,j))

24 6/11/2015 © Bud Mishra, 2001 L7-24 Recurrence: Base Notation: ?, “undefined” X S (0,0) = 0,X S (i,0) = ?,X S (0,j) = ? X L (0,0) = ?,X L (i,0) = -  (i),X L (0,j) = ? X R (0,0) = ?,X R (i,0) = ?,X R (0,j) = -  (j) V(0,0) = 0, V(i,0) = -  (i),V(0,j) = -  (j)

25 6/11/2015 © Bud Mishra, 2001 L7-25 Recurrence: Induction i > 0 and j > 0: X S (i,j) = V(i-1,j-1) + d(S 1 [i], S 2 [j]) X L (i,j) = max 0 5 k 5 j-1 (V(i,k) -  (j-k)) X R (i,j) = max 0 5 l 5 i-1 (V(l,j) -  (i-l)) V(i,j) = max(X L (i,j), X R (i,j),X S (i,j)) Each V(i,j) can be computed in time O(i+j)

26 6/11/2015 © Bud Mishra, 2001 L7-26 Total Time Complexity Let |S 1 | = n and |S 2 | = m. The recurrence can be evaluated with a Dynamic Programming Table of space complexity = O(nm) and in time complexity = O(n 2 m+m 2 n)

27 6/11/2015 © Bud Mishra, 2001 L7-27 Affine Gap Model- Recurrence SWAT : Smith-Waterman Modifying the recurrence equations for the affine case: –X S (0,0) = 0, X S (i,0) = ?, X S (0,j) = ? –X L (0,0) = ?, X L (i,0) = -W g -i W s, X L (0,j) = ? –X R (0,0) = ?, X R (i,0) = ?, X R (0,j) = -W g - j W s –V(0,0) = 0, V(i,0) = -W g -i W s, V(0,j) = -W g - j W s

28 6/11/2015 © Bud Mishra, 2001 L7-28 Recurrence: Induction i > 0 and j > 0: X S (i,j) = V(i-1,j-1) + d(S 1 [i], S 2 [j]) X L (i,j) = max(X L (i, j-1) –W s, ?, X S (i,j-1) – W g –W s, V(i,j-1)-W g -W s ) = max[X L (i, j-1), V(i,j-1)-W g ] –W s X R (i,j) = max( ?, X R (i-1, j) –W s, X S (i-1,j) – W g –W s, V(i-1,j)-W g -W s ) = max[X R (i-1, j), V(i-1,j)-W g ] –W s V(i,j) = max(X L (i,j), X R (i,j),X S (i,j)) Each V(i,j) can be computed in O(1) time. The optimal alignment with affine gap weights can be computed with a DP table of space and time complexity = O(nm).

29 6/11/2015 © Bud Mishra, 2001 L7-29 Parallelization Systolic Arrays: Create a special-purpose processor P(i,j) for (i,j) th entry of the Dynamic Programming Table. Connect P(i,j) to P(i-1,j), P(i-1,j-1) and P(i, j-1) Each processor holds static data W g and W s. Each processor stores and transmits dynamic data: X S (i,j), X L (i,j), X R (i,j) and V(i,j).

30 6/11/2015 © Bud Mishra, 2001 L7-30 Systolic Computation Dynamically compute in one cycle: –X S (i,j), X L (i,j), X R (i,j), V(i,j) using –X S (i-1,j), X L (i-1,j), X R (i-1,j), V(i-1,j) –X S (i,j-1), X L (i,j-1), X R (i,j-1), V(i,j-1) –X S (i-1,j-1), X L (i-1,j-1), X R (i-1,j-1), V(i-1,j-1) and –W g & W s.

31 6/11/2015 © Bud Mishra, 2001 L7-31 Database Search Blast & Its relatives: A query search \Rightarrow –Compare the query sequence to all the sequences in the database for local similarities. Heuristics: –BLAST –FAST Needs good complexity Analysis

32 6/11/2015 © Bud Mishra, 2001 L7-32 BLAST Basic Local Alignment Search Tool Query sequence,  2  *, Database, L µ  * BLAST returns a list of high scoring segment pairs between the query sequence and sequences in the database. Score function depends on  -PAM score functions.

33 6/11/2015 © Bud Mishra, 2001 L7-33 BLAST Heuristics BLAST is a 3 step algorithm: –Step 1. Compile list of high scoring strings: W = words. W =All w-mers that score at least  with some w-mer of the query. –Step 2. Search for hits—Each hit defines a seed. Construct a DFA to recognize \cW. Scan the database compiling the hits. –Step 3. Extend the seeds. The seeds are extended in both directions until the score falls a certain distance below the best so far.

34 6/11/2015 © Bud Mishra, 2001 L7-34 FAST s, t = Two sequences being compared. |s| = m & |t| = n. –Step 1. Determine k-tuples common to both sequences—k = 1 or 2. –Step 2. “Offset” of a common k-tuple is computed. If the common k-tuples start at position s[i] and t[j], then offset = i-j –Step 3. Determine the most common offset value to align the sequences. –Step 4. Combine the common k-tuples to create a region.

35 6/11/2015 © Bud Mishra, 2001 L7-35 Example Offsets for 1-tuples –A ( (2,6,7) –F ( (4) –H ( (1) –I ( (9) –L ( (11) –Q ( (8) –R ( (3) –V ( (10) –Y ( (5) Alignment: –H A R F Y A A Q I V L + + + | | | | + – V D MA AQ I A 1 2 3 4 5 6 7 8 9 10 11 s= H A R F Y A A Q I V L t = V D M A A Q I A 1 2 3 4 5 6 7 8 {9} {-2,2,3} {-3,1,2} {-6,-2,-1} {2} -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9


Download ppt "6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science."

Similar presentations


Ads by Google