Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

Similar presentations


Presentation on theme: ". Multiple Sequence Alignment Tutorial #4 © Ilan Gronau."— Presentation transcript:

1 . Multiple Sequence Alignment Tutorial #4 © Ilan Gronau

2 2 Multiple Sequence Alignment Reminder S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG T-AT-A --A--A CCACCA -GC-GC

3 3 Input: Sequences S 1, S 2,…, S k over the same alphabet Output: Gapped sequences S’ 1, S’ 2,…, S’ k of equal length 1.|S’ 1 |= |S’ 2 |=…= |S’ k | 2.Removal of spaces from S’ i obtains S i Sum-of-pairs (SP) score for a multiple global alignment is the sum of scores of all pairwise alignments induced by it. Multiple Sequence Alignment Reminder

4 4 The ‘star’ algorithm: Input: Γ - set of k strings S 1, …,S k. 0.For each i<j calculate D(S i,S j ). 1.Find the string S’ (center) that minimizes 2.Denote S 1 =S’ and the rest of the strings as S 2, …,S k 3.Iteratively add S 2, …,S k to the alignment as follows: a.Suppose S 1, …,S i-1 are already aligned as S’ 1, …,S’ i-1 b.Align S i to S’ 1 to produce S’ i and S’’ 1 aligned c.Adjust S’ 2, …,S’ i-1 by adding spaces where spaces were added to S’’ 1 d.Replace S’ 1 by S’’ 1 Multiple Sequence Alignment Approximation Algorithm

5 5 Multiple Sequence Alignment Reminder Problem: Conventional MA does not model correctly evolutionary relationships Optimal sum-of-pairs alignment Star algorithm alignment Tree-based alignment

6 6 Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. How do we label internal vertices? Sequences Profiles (multiple alignments) Tree Alignment

7 7 A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table. Column i holds the distribution of Σ (and gap) in that position Profile Alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- A 1001100 T 1002100 G 0310001 C 0000030 - 1020102 : 3

8 8 Aligning a sequence to a profile: Matching letter to position: weighted average of scores Indels: introducing new columns gets special consideration (same goes for aligning two profiles) Profile Alignment A 1001100 T 1002100 G 0310001 C 0000030 - 1020102 : 3 Solve using standard DP algorithms for pairwise alignment

9 9 Progressive MA using a phylogenetic tree: At each point hold profiles for all leaves Choose neighboring leaves - neighbors – have common father in T Align the two profiles to get the ‘father-profile’ New profile replaces the two old ones in set of leaf-profiles How do we obtain the phylogenetic tree? From pairwise distances between sequences Algorithms such as UPGMA, Neighbor-Joining, etc… We discuss such algorithms later in the course Clustal Algorithm ClustalW – more advanced version. Sequences/profiles are weighted

10 10 Lifted Tree Alignments Lifted tree alignment – each internal node is labeled by one of the labels of its daughters Internal nodes are sequences and not profiles Example: S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 We’ll show: 1. DP algorithm for optimal lifted tree alignment 2. Optimal lifted alignment is 2-approximation of optimal tree alignment

11 11 Lifted Tree Alignments Algorithm Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X ) Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal. Basic principle: calculate for every node v in T, and sequence S in X : d(v,S) - the optimal cost of v ’s subtree when it is labeled by S The cost of optimal tree is S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

12 12 Lifted Tree Alignments Algorithm d(v,S) - the optimal cost of v ’s subtree when it is labeled by S Initialization: for leaf v labeled S v - Recurrence: for internal node v with daughters u 1,…u l - Correctness: check for suboptimal solution property Complexity: O(k 2 ) pairwise alignments - O(n 2 k 2 ). k-1 iterations For internal node v - O(k v 2 ) work k v - number of leaves in subtree of v Total: O(k 2 (n 2 +depth(T))) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 O(k 2 depth(T))=O(k 3 )

13 13 Lifted Tree Alignments Approximation analysis Claim: Optimal LTA 2-approximates general tree alignments We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes (? can be generalized for profile-labeled nodes ?) Notations: T* - optimal TA labels S v * - label of node v in T* T L – our constructed LTA S v L (or simply S v ) - label of node v in T L S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

14 14 Lifted Tree Alignments Approximation analysis Construction: We label the nodes bottom-up. For node v with daughters u 1,…u l – we choose the label (from S u1 L,…,S u l L ) closest to S v * We need to show: D(T L ) ≤ 2D(T*) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5

15 15 Lifted Tree Alignments Approximation analysis Analysis: Some edges in T L have cost 0 Observe edges (v,u) of cost > 0: ( v parent of u ) P(v,u) – the path in T* from v to the leaf labeled by S u D(S v,S u ) ≤ D(S v,S v *) + D(S u,S v *) ≤ 2D(S u,S v *) ≤ 2D(P(v,u)) S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 triangle inequality choice of S v triangle inequality D(S v,S u ) ≤ 2D(P(v,u)) If (u,v) and (u’,v’) are two different edges with cost > 0 in T L, then P(u,v) and P(u’,v’) are mutually disjoint in edges Q.E.D.

16 16 Lifted Tree Alignments Approximation analysis S1S1 S2S2 S3S3 S4S4 S6S6 S5S5 S2S2 S4S4 S4S4 S5S5 Final Remarks: Lifted tree alignment T L is only conceptual (we don’t have T* ) Optimal LTA cannot cost more than T L In case of profile-labeled nodes: construction and analysis OK when cost is still distance function


Download ppt ". Multiple Sequence Alignment Tutorial #4 © Ilan Gronau."

Similar presentations


Ads by Google