Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes.

Slides:



Advertisements
Similar presentations
Sequence Alignment I Lecture #2
Advertisements

Computational Genomics Lecture #3a
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Fast Algorithms For Hierarchical Range Histogram Constructions
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
CSC401 – Analysis of Algorithms Lecture Notes 12 Dynamic Programming
Defining Scoring Functions, Multiple Sequence Alignment Lecture #4
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Multiple sequence alignment
Multiple Sequence alignment Chitta Baral Arizona State University.
BNFO 602 Multiple sequence alignment Usman Roshan.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Multiple Sequence Alignment
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then Shlomo Moran. Background Readings:
. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes.
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
9/1/ Ultrametric phylogenies By Sivan Yogev Based on Chapter 11 from “Inferring Phylogenies” by J. Felsenstein.
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignments
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Multiple Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Multiple Alignment.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Multiple sequence alignment (msa)
Bioinformatics Algorithms and Data Structures
Computational Genomics Lecture #2b
Sequence Alignment 11/24/2018.
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
CS 581 Tandy Warnow.
CSE 589 Applied Algorithms Spring 1999
Multiple Sequence Alignment
Trevor Brown DC 2338, Office hour M3-4pm
Computational Genomics Lecture #3a
Clustering.
Presentation transcript:

Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor. Background Readings: Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis, Durbin et al., Chapters , in Introduction to Computational Molecular Biology, Setubal and Meidanis, Chapter 15 in Gusfield’s book. p. 81 in Kanehisa’s book Multiple sequence alignment

Ladies and Gentlemen Boys and Girls the holy grail Multiple Sequence Alignment

Multiple Sequence Alignment S 1 =AGGTC S 2 =GTTCG S 3 =TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC

Multiple Sequence Alignment Definition: Given strings S 1, S 2, …,S k a multiple (global) alignment map them to strings S’ 1, S’ 2, …,S’ k that may contain blanks, where: 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.The removal of spaces from S’ i leaves S i Aligning more than two sequences.

Multiple alignments We use a matrix to represent the alignment of k sequences, K=(x 1,...,x k ). We assume no columns consists solely of blanks. MQ_ILLL MLR-LL- MK_ILLL MPPVLIL The common scoring functions give a score to each column, and set: score(K)= ∑ i score(column(i)) For k=10, a scoring function has 2 k -1 > 1000 entries to specify. The scoring function is symmetric - the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V). x1x1 x2x2 x3x3 x4x4

SUM OF PAIRS MQ_ILLL MLR-LL- MK_ILLL MPPVLIL A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑ i<j score(x i,x j ). In order for this score to be written as ∑ i score(column(i)), we set score(-,-) = 0. Why ? Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (lines). Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks).

SUM OF PAIRS MQ_ILLL MLR-LL- MK_ILLL MPPVLIL Definition: The sum-of-pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all projected pairwise alignments induced by A where the pairwise alignment function score(x i,x j ) is additive.

Example Consider the following alignment: a c - c d b - - c - a d b d a - b c d a d Using the edit distance and for, this alignment has a SP value of = 12

Multiple Sequence Alignment Given k strings of length n, there is a natural generalization of the dynamic programming algorithm that finds an alignment that maximizes SP-score(K) = ∑ i<j score(x i,x j ). Instead of a 2-dimensional table, we now have a k-dimensional table to fill. For each vector i =(i 1,..,i k ), compute an optimal multiple alignment for the k prefix sequences x 1 (1,..,i 1 ),...,x k (1,..,i k ). The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2 k -1 adjacent entries.

The idea via K=2 V[i,j]V[i+1,j] V[i,j+1]V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2 k -1 non-zero binary vectors (1,1), (1,0), (0,1). Recall the notation: and the following recurrence for V :

Multiple Sequence Alignment Given k strings of length n, there is a generalization of the dynamic programming algorithm that finds an optimal SP alignment. Computational Cost: Instead of a 2-dimensional table we now have a k-dimensional table to fill. Each dimension’s size is n+1. Each entry depends on 2 k -1 adjacent entries. Number of evaluations of scoring function : O(2 k n k )

Complexity of the DP approach Number of cells n k. Number of adjacent cells O(2 k ). Computation of SP score for each column(i,b) is o(k 2 ) Total run time is O(k 2 2 k n k ) which is totally unacceptable ! Maybe one can do better?

But MSA is Intractable Not much hope for a polynomial algorithm because the problem has been shown to be NP complete. Proof is quite tricky and quite recent. Some previous proofs were bogus. Isaac Elias provided an apparently correct proof. Need heuristic or approximation to reduce time.

Tree Alignments Assume that there is a tree T=(V,E) whose leaves are the input sequences. Want to associate a sequence in each internal node. Tree-score(K) = ∑ (i,j)  E score(x i,x j ). Finding the optimal assignment of sequences to the internal nodes is NP Hard. We will meet this problem again in the study of phylogenetic trees (it is related to the parsimony problem).

Multiple Sequence Alignment Heuristics similar Perform all 6 pair wise alignments. Find scores. Build a “similarity tree”. A. B. Multiple alignment following the tree from A. Example - 4 sequences A, B, C, D. ABCDABCD BDACBDAC Align most similar pairs allowing gaps to optimize alignment. B D A C Align the next most similar pair. Now, “align the alignments”, introducing gaps if necessary to optimize alignment of (BD) with (AC). distant

The tree-based progressive method for multiple sequence alignment, used in practice (Clustal) (a) a tree (dendrogram) obtained by “cluster analysis” (b) pairwise alignment of sequences’ alignments. (a) DEHUG3 DEPGG3 DEBYG3 DEZYG3 DEBSGF (b) L W R D G R G A L Q L W R G G R G A A Q D W R - G R T A S G L R R - A R T A S A L - R G A R A A A E (modified from Speed’s ppt presentation, see p. 81 in Kanehisa’s book)

Visualization of Alignment

Multiple Sequence Alignment – Approximation Algorithm Now we will see an O(k 2 n 2 ) multiple alignment algorithm for the SP-score that approximate the optimal solution’s score by a factor of at most 2(1-1/k) < 2.

Star Alignments Rather then summing up all pairwise alignments, select a fixed sequence S1 as a center, and set Star-score(K) = ∑ j>0 score(S1,Sj). The algorithm to find optimal alignment: at each step, add another sequence aligned with S1, keeping old gaps and possibly adding new ones (i.e. keeping old alignment intact).

Multiple Sequence Alignment – Approximation Algorithm Polynomial time algorithm: assumption: the function δ is a distance function : (triangle inequality) Let D(S,T) be the value of the minimum global alignment between S and T.

Multiple Sequence Alignment – Approximation Algorithm (cont.) Polynomial time algorithm: The input is a set Γ of k strings S i. 1. Find “center string” S 1 that minimizes 2. Call the remaining strings S 2, …,S k. 3. Add a string to the multiple alignment that initially contains only S 1 as follows: Suppose S 1, …,S i-1 are already aligned as S’ 1, …,S’ i-1. Add S i by running dynamic programming algorithm on S’ 1 and S i to produce S’’ 1 and S’ i. Adjust S’ 2, …,S’ i-1 by adding gaps to those columns where gaps were added to get S’’ 1 from S’ 1. Replace S’ 1 by S’’ 1.

Multiple Sequence Alignment – Approximation Algorithm (cont.) Time analysis: Choosing S 1 – running dynamic programming algorithm times – O(k 2 n 2 ) When S i is added to the multiple alignment, the length of S 1 is at most i* n, so the time to add all k strings is

Multiple Sequence Alignment – Approximation Algorithm (cont.) Performance analysis: M - The alignment produced by this algorithm. For all i, d(1,i)=D(S 1,S i ) (we performed optimal alignment between S’ 1 and S i and ) d(i,j) - the distance M induces on the pair S i,S j. M* - optimal alignment.

Multiple Sequence Alignment – Approximation Algorithm (cont.) Performance analysis: Triangle inequality Definition of S 1

Multiple Sequence Alignment – Approximation Algorithm Algorithm relies heavily on scoring function being a distance. It produced an alignment whose SP score is at most twice the minimum. What if scoring function was similarity? Can we get an efficient algorithm whose score is half the maximum? Third of maximum? … We dunno !