Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Slides:



Advertisements
Similar presentations
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Space-for-Time Tradeoffs
Greedy Algorithms Amihood Amir Bar-Ilan University.
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Dynamic Programming.
David Luebke 1 5/4/2015 CS 332: Algorithms Dynamic Programming Greedy Algorithms.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
SUFFIX TREES From exact to approximate string matching. 17 dicembre 2003 Luca Bortolussi.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Dynamic Programming Optimization Problems Dynamic Programming Paradigm
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Class 2: Basic Sequence Alignment
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Number Sequences Lecture 7: Sep 29 ? overhang. This Lecture We will study some simple number sequences and their properties. The topics include: Representation.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:
Dynamic Programming.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
Chapter 3 Computational Molecular Biology Michael Smith
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
CSC 211 Data Structures Lecture 13
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Refining Core String Edits and Alignments. How to find the optimal alignment in linear space.
Core String Edits, Alignments, and Dynamic Programming.
CSE 589 Applied Algorithms Spring 1999
Bioinformatics Algorithms and Data Structures
Longest Common Subsequence
Dynamic Programming II DP over Intervals
Chap 3 String Matching 3 -.
Longest Common Subsequence
Presentation transcript:

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion methods: fast expected running times

1. Linear Space Hirschberg [1977] Suppose we only need the maximum similarity value of S and T without an alignment or transcript How can we conserve space? –Only save row i-1 when computing row i in the table

Illustration 01234nn … m...

Linear space and an alignment Assume S has length 2n Divide and conquer approach –Compute value of optimal alignment of S[1..n] with all prefixes of T Store row n only at end along with pointer values of row n –Compute value of optimal alignment of S r [1..n] with all prefixes of T r Store only values in row n Find k such that –V(S[1..n],T[1..k]) + V(S r [1..n],T r [1..m-k]) –is maximized over 0 <= k <=m

Illustration V(S[1..6], T[1..0]) V(S r [1..6], T r [1..18]) k=0 m-k=18

Illustration V(S[1..6], T[1..1]) V(S r [1..6], T r [1..17]) k=1 m-k=17

Illustration V(S[1..6], T[1..2]) V(S r [1..6], T r [1..16]) k=2 m-k=16

Illustration V(S[1..6], T[1..9]) V(S r [1..6], T r [1..9]) k=9 m-k=9

Illustration V(S[1..6], T[1..18]) V(S r [1..6], T r [1..0]) k=18 m-k=0

Illustration

Recursive Step Let k* be the k that maximizes –V(S[1..n],T[1..k]) + V(S r [1..n],T r [1..m-k]) Record all steps on row n including the one from n-1 and the one to n+1 Recurse on the two subproblems –S[1..n-1] with T[1..j] where j <= k* –S r [1..n] with T r [1..q] where q <= m-k*

Illustration

Time Required cmn time to get this answer so far Two subproblems have at most half the total size of this problem –At most the same cmn time to get the rest of the solution cmn/2 + cmn/4 + cmn/8 + cmn/16 + … <= cmn/2 Final result –Linear space with only twice as much time

Extending to local alignment What are the problems? Don’t know what substrings of S and T to align, so we won’t know midpoints Solution –Find end point by computing only values and storing max value (and location) along the way –Find start point by computing a “reversed” dynamic program using the reverse strings starting at i in S and j in T –Once end points are fixed, just like global alignment

2. Bounded Difference Suppose the number of differences between S and T is bounded Typically focus on (unweighted) edit distance Can we speed things up? Motivation: –pages

Problem Definition 1 k-difference global alignment –Input Strings S and T –Task Find best global alignment of S and T containing at most k mismatches and spaces or say that no such alignment exists

Problem Definition 2 k-difference inexact matching –Input Strings P and T –Task Find all ways, if any, to match P in T using at most k character substitutions, insertions, and deletions, or report that no such matches exist. –End spaces in T but not P are free

k-mismatch problem –Input Strings P and T –Task Find all ways, if any, to match P in T using at most k character substitutions, insertions, and deletions, or report that no such matches exist. –No internal spaces Earlier Problem Definition

Example Difference between k-mismatch problem and the k-difference problem Inputs –P = abcdefghi –T = abcdeefghi Minimum # of mismatches is 4 Minimum # of differences is 1 with 1 space in P after the e

Solution for k-difference global alignment Compute edit distance of S and T but only fill in an O(km)-size portion of the table Work only with diagonals that are within k of the main diagonal If result in D(n,m) is <= k, then there is an optimal alignment If result in D(n,m) is >k, then the optimal alignment has value > k (though possibly less than D(n,m)

Illustration k=4

Unknown k* Suppose we don’t know the optimal k* a priori Use doubling trick to guess k* –Start with k=1 –Then k=2 –Then k=4 –Then k=8 –…–… –Final work will be O(k*m)

k-difference inexact matching Solution method –O(km) time and space solution for the problem Can be reduced to O(m+n) space if we only want the end position in T of the match –Hybrid dynamic programming –Use suffix trees with longest common extension together with dynamic programming to solve this problem Note, the first row of table will be 0 to reflect end spaces in T are free

Definitions Diagonals are numbered –1 to m above main diagonal –-1 to -n below the main diagonal A d-path in the dynamic programming table is a path that starts in row 0 and specifies a total of d mismatches and spaces A d-path is farthest reaching in diagonal i if –it is a d-path that ends in diagonal i and –its ending column c is >= the ending column of any other d-path that ends in diagonal i

Illustration

Approach To compute farthest-reaching d-path on diagonal i –Take farthest-reaching (d-1)-path on diagonal i+1 Move down one square, and then do a longest common extension from that point –Take farthest-reaching (d-1)-path on diagonal i-1 Move right one square, and then do a longest common extension from that point –Take farthest-reaching (d-1)-path on diagonal i Move diagonally one square, and then do a longest common extension from that point

Diagonal i+1 (d-1)-path Finding longest 1-path on diagonal 3 using longest 0-paths on diagonals 2, 3, and 4 as a starting point.

Diagonal i-1 (d-1)-path Finding longest 1-path on diagonal 3 using longest 0-paths on diagonals 2, 3, and 4 as a starting point.

Diagonal i (d-1)-path Finding longest 1-path on diagonal 3 using longest 0-paths on diagonals 2, 3, and 4 as a starting point.

High level outline d = 0; –For i = 0 to m do find longest common extension of P(1) and T(i) This is the 0-path on diagonal i For d = 1 to k do –For i = -n to m do using farthest reaching (d-1) paths on diagonals i-1, i, and i+1, find farthest reaching d-path on diagonal i Any path that reaches row n defines an inexact match of P in T that contains at most k differences

3. Exclusion Methods Previous methods still have running time  (km) Can we get to expected times of O(m) or even smaller? –Note, we are not asking for worst-case times this small. For example, Boyer-Moore has sublinear time for the exact matching problem

Partition Idea Partition T or P into consecutive regions of a given length r Search/Filter Phase –Using various exact matching methods, search using these partition values to filter out possible locations of P in T Check phase –For each surviving location, use an approximate matching technique to verify an approximate occurrence of P

BYP Choices Baeza-Yates and Perleberg O(m) expected running time for modest error rates Let r = floor(n/(k+1)) Partition P into consecutive length-r intervals –last interval may have length less than r Key property –There are at least k+1 intervals of P that have full length r –If P matches a substring T’ of T with at most k differences, then T’ must contain one interval of length r that matches one of the k+1 intervals of P exactly.

BYP Algorithm Let P’ be the set of k+1 substrings of P taken from the first k+1 regions of P’s partition Build a keyword tree for P’ Using Aho-Corasick, find I, the set of all starting locations in T where any pattern in P occurs exactly For each i in I, use an approximate matching algorithm (probably based on dynamic programming) to locate end points of all approximate occurrences of P in substring T[i-n- k..i+n+k]

Running Time Analysis Search phase: O(n+m) time and O(n) space –We could use suffix trees or suffix trees and matching statistics as well for similar performance –We could use Boyer-Moore set matching techniques described in Section 7.16 to speed this up even more Check Phase –Dynamic programming takes O(n 2 ) time per location checked –Previous results can be used for O(kn) time per location checked

Expected running time Need to get expected size of number of locations to be checked Probability model –Each character of T is drawn uniformly at random from the alphabet of size q An upper bound on the expected number of occurrences of a region p from P’ in T is m(k+1)/q r –T has roughly m substrings of length r –Each substring matches an individual p with probability 1/q r

Expected running time Expected time of checking: [m(k+1)/q r ] n 2 –(number of occurrences) x (time per occurrence) Need to determine what values of k make this cost <= a constant times m Some mathematical manipulation leads to BYP is O(m) as long as k = O(n/log n) –That is, error rate is less than 1 every log n characters

Extensions See the book, pages , for some extensions to these ideas The expected work can be made sublinear