Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.

Similar presentations


Presentation on theme: "Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky."— Presentation transcript:

1

2 Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky

3 Outline Alignment re-cap End-space free alignment Affine gap alignment algorithm and proof Bounded gap/spaces alignments

4 Dynamic programming Useful in many string-related settings Will be repeatedly used in the course General idea Confine the exponential number of possibilities into some “hierarchy”, such that the number of cases becomes polynomial

5 Dynamic programming for shortest paths Finding the shortest path from X to Y using the Floyd Warshall Idea: if we know what is the shortest path using intermediate vertices {1,…, k-1}, computing shortest paths using {1,…, k} is easy w ij if k=0 d ij (k) = min{d ij (k-1), d ik (k-1) +d kj (k-1) } otherwise

6 Alignment reminder Something1|G Something2|C Something1|G Something2|C Somethin g1|G Something2C|- Something1|G Something1|C Something1|G Something1|C Something1|G Something1|C Something1G|- Somethin g2|C

7 Global alignment Input: S 1,S 2 Output: Minimum cost alignment V(k,l) – score of aligning S 1 [1..k] with S 2 [1..l] Base conditions: V(i,0) =  k=0..i  (s k,-) V(0,j) =  k=0..j  (-,t k ) Recurrence relation: V(i-1,j-1) +  (s i,t j )  1  i  n, 1  j  m: V(i,j) = max V(i-1,j) +  (s i,-) V(i,j-1) +  (-,t j )

8 Alignment reminder Global alignment All of S 1 has to be aligned with all of S 2 Every gap is “payed for” Solution equals V(n,m) Alignment score here Traceback all the way

9 Local alignment Subset of S 1 aligned with a subset of S 2 Gaps outside subsets “costless” Solution equals the maximum score cell in the DP matrix Base conditions: V(i,0) = 0 V(0,j) = 0 Recurrence relation: V(i-1,j-1) +  (s i,t j )  1  i  n, 1  j  m: V(i,j) = max V(i-1,j) +  (s i,-) V(i,j-1) +  (-,t j ) 0

10 Ends-free alignment Something between global and local Consider aligning a gene to a (bacterial) genome Gaps in the beginning and end of S and T are costless But all of S,T should be aligned Base conditions: V(i,0) = 0 V(0,j) = 0 Recurrence relation: V(i-1,j-1) +  (s i,t j )  1  i  n, 1  j  m: V(i,j) = max V(i-1,j) +  (s i,-) V(i,j-1) +  (-,t j ) The optimal solution is found at the last row/column (not necessarily at bottom right corner)

11 Handling weird gaps Affine gap: different cost for a “new” and “old” gaps Something1|G Something2|C Something1|G Something2|C Somethin g1|G Something2C|- Something1|G Something1|C Something1|G Something1|C Something1|G Something1|C Something1G|- Somethin g2|C Now we care if there were gaps here Two new things to keep track  Two additional matrices

12 Alignment with Affine Gap Penalty Base Conditions: V(i, 0) = F(i, 0) = W g + iW s V(0, j) = E(0, j) = W g + jW s Recursive Computation: V(i, j) = max{ E(i, j), F(i, j), G(i, j)} where: G(i, j) = V(i-1, j-1) +  (s i, t j ) E(i, j) = max{ E(i, j-1) + W s, G(i, j-1) + W g + W s, F(i, j-1) + W g + W s } F(i, j) = max{ F(i-1, j) + W s, G(i-1, j) + W g + W s, E(i-1, j) + W g + W s } S.....i T.....j S.....i------ T..............j S...............i T.....j------- G(i,j) E(i,j) Time complexity O(nm) - compute 4 matrices instead of one. Space complexity O(nm) - saving 3 (Why?) matrices. O(n+m) w/ Hir.

13 When do constant and affine gap costs differ? Consider: AGAGACTGACGCTTA ATATTA AGAGACTGACGCTTA ----A-T-A---TTA Constant penalty: Mismatch: -5 Gap: -1 AGAGACTGACGCTTA ATA---------TTA Affine penalty: Mismatch: -5 Gap open: -3 Gap extend: -0.5 -9 -14 -12 -14.5

14 Bounding the number of gaps Lets say we are allowed to have at most K gaps (Gaps ≠ Spaces  Gap can contain many spaces) Now we keep track of the number of gaps we opened so far Also still need to keep track of whether a gap is currently open in S or T (E/F matrices)

15 Bounding the number of gaps A “multi-layer” DP matrix Actually separate functions – V,E,F, on every layer, keeping track of layer no. Every time we open or close a gap we “jump” to the next layer Where to look for the solution? (not only at last layer!) What is the complexity?

16 Bounding the number of spaces Let’s say that no gap can exceed k spaces Of course now cannot also bound number of gaps as well (why?) How many matrices do we need now? Here, no monotone notion of layer like before What’s the complexity?

17 What about arbitrary gap functions? If the gap cost is an arbitrary function of its length f(k) Thus, when computing D ij, we need to look at j places “back” and i places “up”: Complexity? Something1|G Something1|C min

18 Special cases How about a logarithmic penalty? W g +W s *log(k) This is a special case of a convex penalty, which is solvable in O(mn*log(m)) The logarithmic case can be done in O(mn) For a piece-wise linear gap function made of K lines, DP can be done in O(mn*log(K))

19 Supersequence Exercise: A is called a non-contiguous supersequence of B if B is a non- contiguous subsequence of A. e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU) Given S and T, find their shortest common supersequence

20 Reminder: LCS Longest common non-contigous subsequence: Adjust global alignment with similarity scores 1 for match 0 for gaps -∞ for mismatches

21 Supersequence Find the longest common sub-sequence of S,T Generate the string as follows: for every column in the alignment Match – add the matching character (once!) Gap – add the character aligned against the gap

22 Supersequence For S=“Pride” T=“Parade”: P-R-IDE PARA-DE PARAIDE – Shortest common supersequence

23 Exercise: Finding repeats Basic objective: find a pair of subsequences within S with maximum similarity Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why wrong?) But using local alignment is still a good idea

24 Variant #1 Specific requirement: the two sequences may overlap Solution: Change the local alignment algorithm: Compute only the upper triangular submatrix (V(i,j), where j>i). Set diagonal values to 0 Complexity: O(n 2 ) time and O(n) space

25 Variant #2 Specific requirement: the two sequences may not overlap Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n] Check local alignments between S[1..k] and S[k+1..n] for any 1<=k<n Pick the highest-scoring alignment Complexity: O(n 3 ) time and O(n) space

26 Variant #2

27 Variant #3 Specific requirement: the two sequences must be consequtive (tandem repeat) Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n], No penalties for gaps in the beginning of S[1..k] No penalties for gaps in the end of S[k+1..n] Complexity: O(n 3 ) time and O(n) space

28 Variant #3

29 Variant #4 Specific requirement: the two sequences must be consequtive and the similarity is measured between the first sequence and the reverse complement of the second - S RC (inverted repeat) Tempting (albeit wrong) to use something in the spirit of variant #3 – will give complexity O(n 3 )

30 Variant #4 Solution: Compute the local alignment between S and S RC Look for results on the diagonal i+j=n AGCTAACGCGTTCGAA (n=16) Complexity: O(n 2 ) time, O(n) space Index 8   Index 8


Download ppt "Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky."

Similar presentations


Ads by Google