Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.

Slides:



Advertisements
Similar presentations
1 Chapter 6 Dynamic Programming Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Advertisements

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
CSEP 521 Applied Algorithms Richard Anderson Lecture 7 Dynamic Programming.
CSCI 256 Data Structures and Algorithm Analysis Lecture 13 Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some.
Dynamic Programming Lecture 9 Asst. Prof. Dr. İlker Kocabaş 1.
DYNAMIC PROGRAMMING. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.
CSE 421 Algorithms Richard Anderson Lecture 19 Longest Common Subsequence.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
CSE 421 Algorithms Richard Anderson Lecture 19 Longest Common Subsequence.
1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material.
1 Algorithmic Paradigms Greed. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer. Break up a problem into.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Dynamic Programming Adapted from Introduction and Algorithms by Kleinberg and Tardos.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
9/10/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 13 Dynamic.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
1 Algorithmic Paradigms Greed. Build up a solution incrementally, myopically optimizing some local criterion. Divide-and-conquer. Break up a problem into.
Dynamic programming 叶德仕
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Final Exam Review Final exam will have the similar format and requirements as Mid-term exam: Closed book, no computer, no smartphone Calculator is Ok Final.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 17.
Minimum Edit Distance Definition of Minimum Edit Distance.
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 17.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 16.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
1 Chapter 6 Dynamic Programming Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
1 Chapter 6-1 Dynamic Programming Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Chapter 6 Dynamic Programming
Weighted Minimum Edit Distance
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Instructor Neelima Gupta Instructor: Ms. Neelima Gupta.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
1 Chapter 6 Dynamic Programming Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 21.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
CS502: Algorithms in Computational Biology
Chapter 6 Dynamic Programming
Dynamic programming 叶德仕
Definition of Minimum Edit Distance
Prepared by Chen & Po-Chuan 2016/03/29
Chapter 6 Dynamic Programming
Dynamic Programming General Idea
Chapter 6 Dynamic Programming
Chapter 6 Dynamic Programming
Memory Efficient Longest Common Subsequence
CSE 589 Applied Algorithms Spring 1999
Chapter 6 Dynamic Programming
Chapter 6 Dynamic Programming
Dynamic Programming General Idea
Bioinformatics Algorithms and Data Structures
Richard Anderson Lecture 19 Memory Efficient Dynamic Programming
Space-Saving Strategies for Computing Δ-points
Richard Anderson Lecture 19 Longest Common Subsequence
Space-Saving Strategies for Analyzing Biomolecular Sequences
Space-Saving Strategies for Computing Δ-points
Asst. Prof. Dr. İlker Kocabaş
Data Structures and Algorithm Analysis Lecture 15
Dynamic Programming.
Richard Anderson Lecture 20 Space Efficient LCS
Presentation transcript:

Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011

Comparing Genes in Two Genomes Small islands of similarity corresponding to similarities between exons Such comparisons are quite common in biology research

String Similarity How similar are two strings? –ocurrance –occurrence ocurrance ccurrenceo - ocurrnce ccurrnceo --a e- ocurrance ccurrenceo - 6 mismatches, 1 gap 1 mismatch, 1 gap 0 mismatches, 3 gaps

Applications. –Basis for Unix diff. –Speech recognition. –Computational biology. Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970] –Gap penalty  ; mismatch penalty  pq. –Cost = sum of gap and mismatch penalties. 2  +  CA CGACCTACCT CTGACTACAT TGACCTACCT CTGACTACAT - T C C C  TC +  GT +  AG + 2  CA - Edit Distance

Goal: Given two strings X = x 1 x 2... x m and Y = y 1 y 2... y n find alignment of minimum cost. Def. An alignment M is a set of ordered pairs x i -y j such that each item occurs in at most one pair and no crossings. Def. The pair x i -y j and x i' -y j' cross if i j'. Ex: CTACCG vs. TACATG. Sol: M = x 2 -y 1, x 3 -y 2, x 4 -y 3, x 5 -y 4, x 6 -y 6. Sequence Alignment CTACC- TACAT- G G y1y1 y2y2 y3y3 y4y4 y5y5 y6y6 x2x2 x3x3 x4x4 x5x5 x1x1 x6x6

Dynamic Programming History Bellman. Pioneered the systematic study of dynamic programming in the 1950s. Etymology. –Dynamic programming = planning over time. –Secretary of Defense was hostile to mathematical research. –Bellman sought an impressive name to avoid confrontation. "it's impossible to use dynamic in a pejorative sense" "something not even a Congressman could object to" Reference: Bellman, R. E. Eye of the Hurricane, An Autobiography.

Dynamic Programming Applications Areas. –Bioinformatics. –Control theory. –Information theory. –Operations research. –Computer science: theory, graphics, AI, systems, …. Some famous dynamic programming algorithms. –Viterbi for hidden Markov models. –Unix diff for comparing two files. –Smith-Waterman for sequence alignment. –Bellman-Ford for shortest path routing in networks. –Cocke-Kasami-Younger for parsing context free grammars.

Sequence Alignment: Problem Structure Def. OPT(i, j) = min cost of aligning strings x 1 x 2... x i and y 1 y 2... y j. –Case 1: OPT matches x i -y j. pay mismatch for x i -y j + min cost of aligning two strings x 1 x 2... x i-1 and y 1 y 2... y j-1 –Case 2a: OPT leaves x i unmatched. pay gap for x i and min cost of aligning x 1 x 2... x i-1 and y 1 y 2... y j –Case 2b: OPT leaves y j unmatched. pay gap for y j and min cost of aligning x 1 x 2... x i and y 1 y 2... y j-1

Sequence Alignment: Algorithm Analysis.  (mn) time and space. English words or sentences: m, n  10. Computational biology: m = n = 100, billions ops OK, but 10GB array? Sequence-Alignment(m, n, x 1 x 2...x m, y 1 y 2...y n, ,  ) { for i = 0 to m M[0, i] = i  for j = 0 to n M[j, 0] = j  for i = 1 to m for j = 1 to n M[i, j] = min(  [x i, y j ] + M[i-1, j-1],  + M[i-1, j],  + M[i, j-1]) return M[m, n] }

Sequence Alignment: Algorithm x m + 1 y n + 1 CTACC- TACAT- G G y1y1 y2y2 y3y3 y4y4 y5y5 y6y6 x2x2 x3x3 x4x4 x5x5 x1x1 x6x6  T A C C  T A C A T G C G Gap penalty = Mismatch penalty = 1

6.7 Sequence Alignment in Linear Space

Sequence Alignment: Linear Space Q. Can we avoid using quadratic space? Easy. Optimal value in O(m + n) space and O(mn) time. –Compute OPT(i, ) from OPT(i-1, ). –No longer a simple way to recover alignment itself. Theorem. [Hirschberg 1975] Optimal alignment in O(m + n) space and O(mn) time. –Clever combination of divide-and-conquer and dynamic programming. –Inspired by idea of Savitch from complexity theory.

Edit distance graph. –Let f(i, j) be shortest path from (0,0) to (i, j). –Observation: f(i, j) = OPT(i, j). Sequence Alignment: Linear Space i-j m-n x1x1 x2x2 y1y1 x3x3 y2y2 y3y3 y4y4 y5y5 y6y6   0-0  

Edit distance graph. –Let f(i, j) be shortest path from (0,0) to (i, j). –Can compute f (, j) for any j in O(mn) time and O(m + n) space. Sequence Alignment: Linear Space i-j m-n x1x1 x2x2 y1y1 x3x3 y2y2 y3y3 y4y4 y5y5 y6y6   0-0 j

Edit distance graph. –Let g(i, j) be shortest path from (i, j) to (m, n). –Can compute by reversing the edge orientations and inverting the roles of (0, 0) and (m, n) Sequence Alignment: Linear Space i-j m-n x1x1 x2x2 y1y1 x3x3 y2y2 y3y3 y4y4 y5y5 y6y6   0-0  

Edit distance graph. –Let g(i, j) be shortest path from (i, j) to (m, n). –Can compute g(, j) for any j in O(mn) time and O(m + n) space. Sequence Alignment: Linear Space i-j m-n x1x1 x2x2 y1y1 x3x3 y2y2 y3y3 y4y4 y5y5 y6y6   0-0 j

Observation 1. The cost of the shortest path that uses (i, j) is f(i, j) + g(i, j). Sequence Alignment: Linear Space i-j m-n x1x1 x2x2 y1y1 x3x3 y2y2 y3y3 y4y4 y5y5 y6y6   0-0

Observation 2. let q be an index that minimizes f(q, n/2) + g(q, n/2). Then, the shortest path from (0, 0) to (m, n) uses (q, n/2). Sequence Alignment: Linear Space i-j m-n x1x1 x2x2 y1y1 x3x3 y2y2 y3y3 y4y4 y5y5 y6y6   0-0 n / 2 q

Divide: find index q that minimizes f(q, n/2) + g(q, n/2) using DP. –Align x q and y n/2. Conquer: recursively compute optimal alignment in each piece. Sequence Alignment: Linear Space i-j x1x1 x2x2 y1y1 x3x3 y2y2 y3y3 y4y4 y5y5 y6y6   0-0 q n / 2 m-n

Theorem. Let T(m, n) = max running time of algorithm on strings of length at most m and n. T(m, n) = O(mn log n). Remark. Analysis is not tight because two sub-problems are of size (q, n/2) and (m - q, n/2). In next slide, we save log n factor. Sequence Alignment: Running Time Analysis Warmup

Theorem. Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = O(mn). Pf. (by induction on n) –O(mn) time to compute f(, n/2) and g (, n/2) and find index q. –T(q, n/2) + T(m - q, n/2) time for two recursive calls. –Choose constant c so that: –Base cases: m = 2 or n = 2. –Inductive hypothesis: T(m, n)  2cmn. Sequence Alignment: Running Time Analysis