Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Chapter 7 Dynamic Programming.
 2004 SDU Lecture11- All-pairs shortest paths. Dynamic programming Comparing to divide-and-conquer 1.Both partition the problem into sub-problems 2.Divide-and-conquer.
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Sequence Alignment Tutorial #2
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Sequence Alignment Tutorial #2
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Dynamic Programming Reading Material: Chapter 7..
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Chapter 8 Dynamic Programming Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Distance Functions for Sequence Data and Time Series
Chapter 8 Dynamic Programming Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Dynamic Programming Reading Material: Chapter 7 Sections and 6.
© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Dynamic Programming A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 8 ©2012 Pearson Education, Inc. Upper Saddle River,
Alignment II Dynamic Programming
Class 2: Basic Sequence Alignment
1 Dynamic programming algorithms for all-pairs shortest path and longest common subsequences We will study a new technique—dynamic programming algorithms.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Sequence Alignment.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 8 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Dynamic Programming Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by or formulated.
CSC401: Analysis of Algorithms CSC401 – Analysis of Algorithms Chapter Dynamic Programming Objectives: Present the Dynamic Programming paradigm.
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
Dynamic Programming Louis Siu What is Dynamic Programming (DP)? Not a single algorithm A technique for speeding up algorithms (making use of.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Dynamic Programming.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
All-Pairs Shortest Paths
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Core String Edits, Alignments, and Dynamic Programming.
Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing.
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Chapter 8 Dynamic Programming
Sequence comparison: Local alignment
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
String Processing.
Sequence Alignment 11/24/2018.
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
CSE 589 Applied Algorithms Spring 1999
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Bioinformatics Algorithms and Data Structures
Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by recurrences with overlapping subproblems.
Presentation transcript:

Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky

Outline Alignment re-cap End-space free alignment Affine gap alignment algorithm and proof Bounded gap/spaces alignments

Dynamic programming Useful in many string-related settings Will be repeatedly used in the course General idea Confine the exponential number of possibilities into some “hierarchy”, such that the number of cases becomes polynomial

Dynamic programming for shortest paths Finding the shortest path from X to Y using the Floyd Warshall Idea: if we know what is the shortest path using intermediate vertices {1,…, k-1}, computing shortest paths using {1,…, k} is easy w ij if k=0 d ij (k) = min{d ij (k-1), d ik (k-1) +d kj (k-1) } otherwise

Alignment reminder Something1|G Something2|C Something1|G Something2|C Somethin g1|G Something2C|- Something1|G Something1|C Something1|G Something1|C Something1|G Something1|C Something1G|- Somethin g2|C

Global alignment Input: S 1,S 2 Output: Minimum cost alignment V(k,l) – score of aligning S 1 [1..k] with S 2 [1..l] Base conditions: V(i,0) =  k=0..i  (s k,-) V(0,j) =  k=0..j  (-,t k ) Recurrence relation: V(i-1,j-1) +  (s i,t j )  1  i  n, 1  j  m: V(i,j) = max V(i-1,j) +  (s i,-) V(i,j-1) +  (-,t j )

Alignment reminder Global alignment All of S 1 has to be aligned with all of S 2 Every gap is “payed for” Solution equals V(n,m) Alignment score here Traceback all the way

Local alignment Subset of S 1 aligned with a subset of S 2 Gaps outside subsets “costless” Solution equals the maximum score cell in the DP matrix Base conditions: V(i,0) = 0 V(0,j) = 0 Recurrence relation: V(i-1,j-1) +  (s i,t j )  1  i  n, 1  j  m: V(i,j) = max V(i-1,j) +  (s i,-) V(i,j-1) +  (-,t j ) 0

Ends-free alignment Something between global and local Consider aligning a gene to a (bacterial) genome Gaps in the beginning and end of S and T are costless But all of S,T should be aligned Base conditions: V(i,0) = 0 V(0,j) = 0 Recurrence relation: V(i-1,j-1) +  (s i,t j )  1  i  n, 1  j  m: V(i,j) = max V(i-1,j) +  (s i,-) V(i,j-1) +  (-,t j ) The optimal solution is found at the last row/column (not necessarily at bottom right corner)

Handling weird gaps Affine gap: different cost for a “new” and “old” gaps Something1|G Something2|C Something1|G Something2|C Somethin g1|G Something2C|- Something1|G Something1|C Something1|G Something1|C Something1|G Something1|C Something1G|- Somethin g2|C Now we care if there were gaps here Two new things to keep track  Two additional matrices

Alignment with Affine Gap Penalty Base Conditions: V(i, 0) = F(i, 0) = W g + iW s V(0, j) = E(0, j) = W g + jW s Recursive Computation: V(i, j) = max{ E(i, j), F(i, j), G(i, j)} where: G(i, j) = V(i-1, j-1) +  (s i, t j ) E(i, j) = max{ E(i, j-1) + W s, G(i, j-1) + W g + W s, F(i, j-1) + W g + W s } F(i, j) = max{ F(i-1, j) + W s, G(i-1, j) + W g + W s, E(i-1, j) + W g + W s } S.....i T.....j S.....i T j S i T.....j G(i,j) E(i,j) Time complexity O(nm) - compute 4 matrices instead of one. Space complexity O(nm) - saving 3 (Why?) matrices. O(n+m) w/ Hir.

When do constant and affine gap costs differ? Consider: AGAGACTGACGCTTA ATATTA AGAGACTGACGCTTA ----A-T-A---TTA Constant penalty: Mismatch: -5 Gap: -1 AGAGACTGACGCTTA ATA TTA Affine penalty: Mismatch: -5 Gap open: -3 Gap extend:

Bounding the number of gaps Lets say we are allowed to have at most K gaps (Gaps ≠ Spaces  Gap can contain many spaces) Now we keep track of the number of gaps we opened so far Also still need to keep track of whether a gap is currently open in S or T (E/F matrices)

Bounding the number of gaps A “multi-layer” DP matrix Actually separate functions – V,E,F, on every layer, keeping track of layer no. Every time we open or close a gap we “jump” to the next layer Where to look for the solution? (not only at last layer!) What is the complexity?

Bounding the number of spaces Let’s say that no gap can exceed k spaces Of course now cannot also bound number of gaps as well (why?) How many matrices do we need now? Here, no monotone notion of layer like before What’s the complexity?

What about arbitrary gap functions? If the gap cost is an arbitrary function of its length f(k) Thus, when computing D ij, we need to look at j places “back” and i places “up”: Complexity? Something1|G Something1|C min

Special cases How about a logarithmic penalty? W g +W s *log(k) This is a special case of a convex penalty, which is solvable in O(mn*log(m)) The logarithmic case can be done in O(mn) For a piece-wise linear gap function made of K lines, DP can be done in O(mn*log(K))

Supersequence Exercise: A is called a non-contiguous supersequence of B if B is a non- contiguous subsequence of A. e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU) Given S and T, find their shortest common supersequence

Reminder: LCS Longest common non-contigous subsequence: Adjust global alignment with similarity scores 1 for match 0 for gaps -∞ for mismatches

Supersequence Find the longest common sub-sequence of S,T Generate the string as follows: for every column in the alignment Match – add the matching character (once!) Gap – add the character aligned against the gap

Supersequence For S=“Pride” T=“Parade”: P-R-IDE PARA-DE PARAIDE – Shortest common supersequence

Exercise: Finding repeats Basic objective: find a pair of subsequences within S with maximum similarity Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why wrong?) But using local alignment is still a good idea

Variant #1 Specific requirement: the two sequences may overlap Solution: Change the local alignment algorithm: Compute only the upper triangular submatrix (V(i,j), where j>i). Set diagonal values to 0 Complexity: O(n 2 ) time and O(n) space

Variant #2 Specific requirement: the two sequences may not overlap Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n] Check local alignments between S[1..k] and S[k+1..n] for any 1<=k<n Pick the highest-scoring alignment Complexity: O(n 3 ) time and O(n) space

Variant #2

Variant #3 Specific requirement: the two sequences must be consequtive (tandem repeat) Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n], No penalties for gaps in the beginning of S[1..k] No penalties for gaps in the end of S[k+1..n] Complexity: O(n 3 ) time and O(n) space

Variant #3

Variant #4 Specific requirement: the two sequences must be consequtive and the similarity is measured between the first sequence and the reverse complement of the second - S RC (inverted repeat) Tempting (albeit wrong) to use something in the spirit of variant #3 – will give complexity O(n 3 )

Variant #4 Solution: Compute the local alignment between S and S RC Look for results on the diagonal i+j=n AGCTAACGCGTTCGAA (n=16) Complexity: O(n 2 ) time, O(n) space Index 8   Index 8