. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

Slides:



Advertisements
Similar presentations
Computational Genomics Lecture #3a
Advertisements

Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
The Saitou&Nei Neighbor Joining Algorithm ©Shlomo Moran & Ilan Gronau.
Problem Set 2 Solutions Tree Reconstruction Algorithms
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
. Phylogenetic Trees - Parsimony Tutorial #12 Next semester: Project in advanced algorithms for phylogenetic reconstruction (236512) Initial details in:
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
5 - 1 Chap 5 The Evolution Trees Evolutionary Tree.
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes.
Defining Scoring Functions, Multiple Sequence Alignment Lecture #4
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Multiple sequence alignment
Multiple Sequence alignment Chitta Baral Arizona State University.
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Chapter 5 The Evolution Trees.
. Phylogenetic Trees - Parsimony Tutorial #11 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Clarifications and Corrections. 2 The ‘star’ algorithm (tutorial #3 slide 13) can be implemented with the following modification: Instead of step (a)
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Multiple Sequence Alignment
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Introduction to Bioinformatics Algorithms Multiple Alignment.
1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms.
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignments
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Phylogenetic Trees Tutorial 5. Agenda How to construct a tree using Neighbor Joining algorithm Phylogeny.fr tool Cool story of the day: Horizontal gene.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Lectures on Greedy Algorithms and Dynamic Programming
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
Tutorial 5 Phylogenetic Trees.
Multiple Sequence Alignment
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Multiple Alignment.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Phylogenetic Trees - Parsimony Tutorial #12
dij(T) - the length of a path between leaves i and j
Bioinformatics Algorithms and Data Structures
Intro to Alignment Algorithms: Global and Local
Multiple Sequence Alignment
Phylogeny.
Computational Genomics Lecture #3a
Presentation transcript:

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau

2 Multiple Sequence Alignment Reminder S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC

3 Input: Sequences S 1, S 2,…, S k over the same alphabet Output: Gapped sequences S’ 1, S’ 2,…, S’ k of equal length 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.Removal of spaces from S’ i obtains S i Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it. Multiple Sequence Alignment Reminder

4 The ‘star’ algorithm: Input: Γ - set of k strings S 1, …,S k. 0.For each i<j calculate D(S i,S j ). 1.Find the string S’ (center) that minimizes 2.Denote S 1 =S’ and the rest of the strings as S 2, …,S k 3.Iteratively add S 2, …,S k to the alignment as follows: a.Suppose S 1, …,S i-1 are already aligned as S’ 1, …,S’ i-1 b.Align S i to S’ 1 to produce S’ i and S’’ 1 aligned c.Adjust S’ 2, …,S’ i-1 by adding spaces where spaces were added to S’’ 1 d.Replace S’ 1 by S’’ 1 Multiple Sequence Alignment Approximation Algorithm

5 Multiple Sequence Alignment Reminder Problem: Conventional MA does not model correctly evolutionary relationships Optimal sum-of-pairs alignment Star algorithm alignment Tree-based alignment

6 Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. How do we label internal vertices? Sequences Profiles (multiple alignments) Tree Alignment

7 A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table. Column i holds the distribution of Σ (and gap) in that position Profile Alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- A T G C : 3

8 Aligning a sequence to a profile: Matching letter to position: weighted average of scores Indels: introducing new columns gets special consideration (same goes for aligning two profiles) Profile Alignment A T G C : 3 Solve using standard DP algorithms for pairwise alignment

9 Progressive MA using a phylogenetic tree: At each point hold profiles for all leaves Choose neighboring leaves - neighbors – have common father in T Align the two profiles to get the ‘father-profile’ New profile replaces the two old ones in set of leaf-profiles How do we obtain the phylogenetic tree? From pairwise distances between sequences Algorithms such as UPGMA, Neighbor-Joining, etc… We discuss such algorithms later in the course Clustal Algorithm ClustalW – more advanced version. Sequences/profiles are weighted

10 Lifted Tree Alignments Lifted tree alignment – each internal node is labeled by one of the labels of its daughters Internal nodes are sequences and not profiles Example: S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 We’ll show: 1. DP algorithm for optimal lifted tree alignment 2. Optimal lifted alignment is 2-approximation of optimal tree alignment

11 Lifted Tree Alignments Algorithm Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. Basic principle: calculate for every node v in T, and sequence S in X : d(v,S) - the optimal cost of v ’s subtree when it is labeled by S The cost of optimal tree is S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

12 Lifted Tree Alignments Algorithm d(v,S) - the optimal cost of v ’s subtree when it is labeled by S Initialization: for leaf v labeled S v - Recurrence: for internal node v with daughters u 1,…u l - Correctness: check for suboptimal solution property Complexity: O(k 2 ) pairwise alignments - O(n 2 k 2 ). k-1 iterations For internal node v - O(k v 2 ) work k v - number of leaves in subtree of v Total: O(k 2 (n 2 +depth(T))) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 O(k 2 depth(T))=O(k 3 )

13 Lifted Tree Alignments Approximation analysis Claim: Optimal LTA 2-approximates general tree alignments We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes (? can be generalized for profile-labeled nodes ?) Notations: T* - optimal TA labels S v * - label of node v in T* T L – our constructed LTA S v L (or simply S v ) - label of node v in T L S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

14 Lifted Tree Alignments Approximation analysis Construction: We label the nodes bottom-up. For node v with daughters u 1,…u l – we choose the label (from S u1 L,…,S u l L ) closest to S v * We need to show: D(T L ) ≤ 2D(T*) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

15 Lifted Tree Alignments Approximation analysis Analysis: Some edges in T L have cost 0 Observe edges (v,u) of cost > 0: ( v parent of u ) P(v,u) – the path in T* from v to the leaf labeled by S u D(S v,S u ) ≤ D(S v,S v *) + D(S u,S v *) ≤ 2D(S u,S v *) ≤ 2D(P(v,u)) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 triangle inequality choice of S v triangle inequality D(S v,S u ) ≤ 2D(P(v,u)) If (u,v) and (u’,v’) are two different edges with cost > 0 in T L, then P(u,v) and P(u’,v’) are mutually disjoint in edges Q.E.D.

16 Lifted Tree Alignments Approximation analysis S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 Final Remarks: Lifted tree alignment T L is only conceptual (we don’t have T* ) Optimal LTA cannot cost more than T L In case of profile-labeled nodes: construction and analysis OK when cost is still distance function