Core String Edits, Alignments, and Dynamic Programming.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Longest Common Subsequence
Greedy Algorithms Greed is good. (Some of the time)
CPSC 335 Dynamic Programming Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
Chapter 7 Dynamic Programming.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
§ 8 Dynamic Programming Fibonacci sequence
Dynamic Programming Reading Material: Chapter 7..
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
1 Pairwise sequence alignment algorithms Elya Flax & Inbar Matarasso Seminar in Structural Bioinformatics - Pairwise sequence alignment algorithms.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Protein Sequence Comparison Patrice Koehl
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Multiple Sequence Alignment
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Introduction to Bioinformatics Algorithms Sequence Alignment.
Extending Alignments Υλικό βασισμένο στο κεφάλαιο 13 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Minimum Edit Distance Definition of Minimum Edit Distance.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.
Bioinformatics.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Sequence Alignment 11/24/2018.
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
ICS 353: Design and Analysis of Algorithms
CSE 589 Applied Algorithms Spring 1999
Bioinformatics Algorithms and Data Structures
Computational Genomics Lecture #3a
Presentation transcript:

Core String Edits, Alignments, and Dynamic Programming

The edit distance between two strings Distance is a measure of the difference between two strings Edit operations ▫Insertion(I): a character into the first string ▫Deletion(D): a character from the first string ▫Replacement(R): a character in the first string with the character in the second string ▫For example: T: RIMDMDMMI S2:V I n t n e r S1:Wr i t e r s

The edit distance between two strings Definition: A string over the alphabet I,D,R,M that describes a transformation of one string to another is called and edit transcript or transcript for short, of the two strings Definition: The edit distance between two strings is defined as the minimum number of edit operations – insertions, deletions, and substitutions – needed to transform the first string into the second. Matches are not counted. Definition: Optimal transcript refer to an edit transcript that uses the minimum number of edit operations. There may be more than one optimal transcript.

String alignment Definition: A (global) alignment of two strings S1 and S2 is obtained by first inserting chosen spaces (or dashes), either into or at the ends of S1 and S2, and then placing the two resulting strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string. Definition: By the term global we mean that for each string, the entire string is involved in the alignment. For example: qac - dbd qawx- b -

Alignment versus edit transcript From the mathematical standpoint, an alignment and an edit transcript are equivalent. Alignmentedit transcript MismatchReplacement Space contained in the first string An insertion of the opposing character into the first string Space contained in the second string An deletion of the opposing character from the first string Only displays a relationship between the two strings Emphasizes the putative mutational events that transform one string to another

Dynamic programming calculation of edit distance Definition: For two strings S1 and S2, D(i,j) is defined to be the edit distance of S1[1..i] and S2[1..j]. D(i,j) denotes the minimum number of edit operations neede to transform the first I characters of S1 into the first j characters of S2. The Dynamic programming approach has three essential components: ▫Recurrence relation ▫Tabular computation ▫Traceback

The Recurrence relation D(i,0) = i, because the only way to transform the first i characters of S1 to zero characters of S2 is to delete all the i characters of S1. D(0,j) = j, because j characters must be inserted to convert zero characters of S1 to j characters of S2. If S1 has n letters and S2 m = > the edit distance would be precisely the value D(n,m) If i>0 and j>0 => where

Correctness of the general recurrence Lemma 1: The value of D(i,j) must be D(i,j-1)+1, D(i-1,j)+1 or D(i-1,j-1)+t(i,j) Proof: Consider an edit transcript for the transformation of S1[1..i], S2[1..j] using the minimum number of edit considerations and focus on the last symbol in that transcript. last symbol can be I,D,R,M ▫if it is I =>insertion of character S2(j) onto the end of the transformed first string. But, the symbols in the transcript before I must specify the minimum number of edit operations to transform S1[1..i] to S2[1..j-1].Otherwise the transformation would use more than the minimum number of operations. This transformation takes D(i,j-1) edit operations.=>if last symbol is I => D(i,j) = D(i,j-1) +1 ▫If it is D => deletion of S1(i) and the symbols in the transcript to the left of that D must specify the minimum number of edit operations to transform S1[1..i-1] to S2[1..j]. But this transformation takes D(i-1,j) edit transformations. if last symbol is D => D(i,j) = D(i-1,j) +1 ▫If it is R => replace of S1(i) with S2(j) and the symbols to the left of that R specify the minimum number of edit operations to transform S1[1..i-1] to S2[1..j-1]. But this transformation takes D(i-1,j-1) edit transformations. if last symbol is D => D(i,j) = D(i-1,j-1) +1 ▫If it is M => S1(i) = S2(j) and D(i,j) = D(i-1,j-1). If R or M => D(i,j) = D(i-1,j-1) +t(I,j)

Correctness of the general recurrence Lemma 2: D(i,j)<= min[D(i,j-1)+1, D(i-1,j)+1 or D(i- 1,j-1)+t(i,j)] Proof: Consider the transformation S1[1..i] into S2[1..j] if it is I =>insertion of character S2(j) onto the end of the transformed first string. ▫D(i,j) = D(i,j-1) +1 if we transform S1[1..i] to S2[1..j-1] with the minimum number of operations and then use one more to insert character S2(j) at the end. ▫D(i,j) = D(i-1,j) +1 if we transform S1[1..i-1] to S2[1..j] with the minimum number of operations and then delete character S1(i). ▫Same D(i,j) = D(i-1,j-1) + t(i,j) if we have R or M.

Correctness of the general recurrence Lemma 3: When both I and j are strictly positive, D(i,j) = min[D(i,j-1)+1, D(i-1,j)+1 or D(i-1,j- 1)+t(i,j)] Proof: Lemma 1 says the value of D(i,j) must be D(i,j-1)+1, D(i-1,j)+1 or D(i-1,j-1)+t(i,j) and lemma 2 says D(i,j) D(i,j) = min[D(i,j-1)+1, D(i-1,j)+1 or D(i-1,j-1)+t(i,j)]

Tabular computation of edit distance if we have a top-down recursive approach to evaluating D(n,m) is simple to program but extremely inefficient for large values of n and m. There are (n+1)x(m+1) combinations of i and j.

Bottom-up computation

Time analysis Theorem 2: The dynamic programming table for computing the edit distance between a string of length n and a string of length, can be filled in with O(nm) work. Hence, using dynamic programming, the edit distance D(n, m) can be computed in O(nm) time.

The traceback

The pointers allow easy recovery of an optimal edit transcript. Follow any path of pointers from cell (n,m) to cell (0,0). PointersAlignmentedit transcript Match MismatchReplacement Space contained in the first string An insertion of the opposing character into the first string Space contained in the second string An deletion of the opposing character from the first string

The traceback Theorem 3: Once the dynamic progamming table with pointers has been computed, an optimal edit transcript can be found in O(n+m) time.

The pointers represent all optimal edit transcipts Theorem 4: Any path from (n,m) to (0,0) following pointers established during the computation of D(i, j) specifies an edit transcript with the minimum number of edit operations. Conversely, any optimal edit transcript is specified by such a path. Moreover, since a path describes only one transcript, the correspondence between paths and optimal transcripts is one – to – one.

Edit graphs Definition: Given two strings S1 and S2 of lengths n and m, respectively, a weighted edit graph has (n+1)x(m+1) nodes, each labeled with a distinct pair (i,j) (, ). The specific edges and their edge weights depend on the specific string problem.

Edit graphs

Theorem 5: An edit transcript for S1, S2 has the minimum number of edit operations if and only if it corresponds to a shortest path from (0,0) to (n,m) in the edit. Corollary 1: The set of all shortest paths from (0,0) to (n,m) in the edit graph exactly specifies the set of all optimal edit transcripts of S1 to S2. Equivalently, it specifies all the optimal (minimum weight) alignments of S1 and S2.

Weighted edit distance Operation weights Definition: With arbitrary operation weights, the operation-weight edit distance problem is to find an edit transcript that transforms string S1 into S2 with the minimum total operation weight.

Computing operation-weight edit distance D(i, j) denotes the minimum total weight for edit operations transforming S1[1..i] to S2[1..j]. d: weight for insertion or deletion r: weight for replacement e: weight for match D(i, 0) = i x d D(0, j) = j x d D(i,j)=min[D(i,j-1)+d,D(i-1,j)+d,D(i-1,j-1)+t(i,j)]

Alphabet-weight edit distance Generalization of edit distance is to allow the weight or score of a replacement, to depend on exactly which character in the alphabet is being removed and which is being added.

String similarity Is a way to measure the relatedness of two strings. The language of alignment is more convenient. Definition: Let Σ be the alphabet used for strings S1 and S2, and let Σ’ be Σ with the added character “-” denoting a space. Then, for any two characters x, y in Σ’, s(x,y) denotes the value (or score) obtained by aligning character x against character y. Definition: For a given alignment A of S1 and S2, let S1’ and S2’ denote the strings after the chosen insertion of spaces, and let l denote the (equal) length of the two strings S1’ and S2’ in A. The value of alignment A is defined as

String similarity For example Σ={a,b,c,d} Alignment ▫cac-dbd ▫cabbdb- Total value =4 s(x,y) is greater than or equal to zero if characters x,y of Σ’ match and less than zero if they mismatch =>emphasize matches between the two strings Definition: Given a pairwise scoring matrix over the alphabet Σ’, the similarity of two strings S1 and S2 is defined as the value of alignment A of S1 and S2 that miximizes total alignment value. This is also called the optimal alignment value of S1 and S2.

Computing similarity Definition: V(i, j) is defined as the value of the optimal alignment of prefixes S1[1..i] and S2[1..j] C0nditions:

Special cases of similarity Definition: In a string S, a subsequence is defined as a subset of the characters of S arranged in their original “relative” order. More formally, a subsequence of a string S of length n is specified by a list of indices i 1 <i 2 <i 3 <…<i k, for some k <=n. The subsequence specified by this list of indices is the string S(i 1 ) S(i 2 ) S(i 3 )… S(i k ). A subsequence need not consist of contiguous characters in S, whereas the characters of a substring must be contiguous Definition: Given two strings S1 and S2, a common subsequence is a subsequence that appears both in S1 and S2. The longest common subsequence problem is to find a longest common subsequence (lcs) of S1 and S2. Theorem: With a scoring scheme that scores a one for each match and a zero for each mismatch or space, the matched characters in an alignment of maximum value form a longest common subsequence. Lcs can be computed in O(nm) time.

Alignment graphs for similarity The computation of similarity can be viewed as a path problem on a directed acyclic graph called an alignment graph Alignment graph: ▫Weights on the edges are the specific values for aligning a specific pair of characters or a character against a space. ▫Start node : The node associated with cell (0,0) ▫Destination node: The node associated with cell (n,m) ▫Optimal alignment comes from the longest start to destination path rather than from the shortest path ▫The longest path can be found in O(nm) time

End-space free variant Is a variant, in which any spaces at the end or the beginning of the alignment contribute a weight of zero, no matter what weight other spaces contribute.

Approximate occurrences of P in T Definition: Given a parameter δ, substring T’ of T is said to be an approximate occurrence of P if and only if the optimal alignment of P to T’ has value at least δ. Theorem: There is an approximate occurrence of P in T ending at position j of T if and only if V(n,j)>=δ. Moreover, T[k..j] is an approximate occurrence of P in T if and only if V(n,j)>=δ and there is a path of backpointers from cell (n,j) to cell (0,k).

Local alignment: finding substrings of high similarity Local alignment problem: Given two strings S1 and S2 find substrings α and β of S1 and S2, respectively, whose similarity (optimal global alignment value) is maximum over all pairs of substrings from S1 and S2. We use v* to denote the value of an optimal solution to the local alignment problem. For example: ▫S1 = pqraxabcstvq, S2 = xyaxbacsll ▫Match:2,mismatch:-2, space:-1 ▫Substrings: α = axabcs, β = axbacs ▫axab_cs ▫ax_bacs ▫Optimal alignment = 8 Local alignment is defined in terms of similarity, which maximizes an objective function, because is more likely to find more meaningful regions of high similarity.

Computing local alignment Definition: Given a pair of indices i<=n and j<=m, the local suffix alignment problem is to find a (possibly empty) suffix α of S1[1..i] and a (possibly empty) suffix β of S2[1..j] such that V(α, β) is maximum over all pairs if suffixes of S1[1..i] and S2[1..j]. We use v(i,j) to denote the value of the optimal local suffix alignment for the given index pair i, j. Theorem: Theorem: If i’, j’ is an index pair maximizing v(i,j) over all i,j pairs, then α pair of substrings solving the local suffix alignment problem for I’, j’ also solves the local alignment problem.

How to solve the local suffix alignment problem Theorem: For i > 0 and j > 0, the proper recurrence for v(i,j) is Proof: Let α and β be the substrings of S1 and S2 whose global alignment establishes the optimal local alignment. 0:α and β are permitted to be empty suffixes of S1[1..i] and S2[1..j] ▫v(i-1, j-1)+s(S1(i),S2(j)):if S1(i) is aligned with S2(j) in the optimal local i,j=> s(S1(i),S2(j)) and the remainder v(i, j) is determined by the local suffix alignment for indices i-1, j-1 ▫v(i-1, j-1)+s(S1(i),_):if S1(i) is aligned with space ▫v(i-1, j-1)+s(_,S2(j)):if S2(j) is aligned with space

Time analysis Theorem: For two strings S1 and S2 of lengths n and m, the local alignment problem can be solved in O(nm) time, the same time as for global alignment. Theorem: All optimal local alignments of two strings are represented in the dynamic programming table for v(i,j) and can be found by tracing any pointers back from any cell with value v*.

Gaps Definition: A gap is any maximal, consecutive run of spaces in a single string of a given alignment. A gap may begin before the start of S, in which case it is bordered on the right by the first character of S, or it may begin after the end of S, in which case it is bordered on the left by the last character of S. Otherwise, a gap must be bordered on both sides by characters of S. For example: ctttaac__a_ac c ___cacccat_c

Why Gaps? A gap in String S1 opposite substring α in string S2 corresponds to either a deletion of α from S1 or to an insertion of α into S2. Important in many biological applications Common gaps in pairs of aligned strings can sometimes be the key features used to deduce the overall evolutionary history of a set of strings An alignment of two strings is intended to reflect the cost of mutational events needed to transform one string to another.

Choices for gap weights Types of gap weights: ▫Constant ▫Affine ▫Convex ▫Arbitrary Constant gap weight: each individual space is free, and each gap is given a weight of W g independent of the number of spaces in the gap. ▫Find an alignment A to maximize (W m (#matches) – W ms (#mismatches)- W g (#gaps)) Affine gap weight: A generalization of the constant gap weight model is to add a weight W s for each space in the gap. The weight contributed by a single gap of length q is given by the affine function W g +qW s ▫Find an alignment A to maximize (W m (#matches) – W ms (#mismatches)- W g (#gaps))-W s (#spaces)) Convex gap weight: a gap weight function where each additional space in a gap contributes less to the gap weight than the preceding space

Time bounds for gap choices Arbitrary gap weight: O(nm 2 +n 2 m) Convex gap weight: O(nmlogm) Affine gap weight: O(nm)

Arbitrary gap weights To align strings S1 and S2, consider, as usual, the prefixes S1[1..i] of S1 and S2[1..j] of S2. Any alignment of those two prefixes is one of the following three types: 1.Alignments of S1[1..i] and S2[1..j] where character S1(i) is aligned to a character strictly to the left of character S2(j). Therefore, the alignment ends with a gap in S1. 2.Alignments of the two prefixes where S1(i) is aligned strictly to the right of S2(j). Therefore, the alignment ends with a gap in S2. 3.Alignments of the two prefixes where characters S1(i) and S2(j) are aligned opposite each other. This includes both the case that S1(i) = S2(j) and that

Arbitrary gap weights Definition: Define E(i,j) as the maximum value of any alignment of type 1; define F(i,j) as the maximum value of any alignment of type 2; define G(i,j) as the maximum value of any alignment of type 3; and finally define V(i,j) as the maximum value of the three terms E(i,j), F(i,j), G(i,j). Recurrences for the case of arbitrary gap weights: if all spaces are included in the objective function

Arbitrary gap weights Recurrences for the case of arbitrary gap weights: ▫When end spaces, and hence end gaps, are free, then the optimal alignment value is the maximum value over any cell in row n or column m, and the base cases are:  V(i,0)=0  V(0,j)=0 Time analysis ▫Theorem: Assuming that |S1|=n and |S2|=m, the recurrences can be evaluated in O(nm 2 +n 2 m) time. Proof: for any fixed row, m(m+1)/2=Θ(m 2 ) cells are examined to evaluate all the E values in that row, and for any fixed column, Θ(n 2 ) cells are examined to evaluate all the F values that column.=> Since we have n rows and m columns => O(nm 2 +n 2 m)

Affine (and constant) gap weights Recurrences for the case of Affine gap weights: ▫For the case where end gaps are included in the alignment value ▫where end gaps are free: ▫The general recurrences are:

Time analysis Theorem: The optimal alignment with affine gap weights can be computed in O(nm) time, the same time as for optimal alignment without a gap term.