Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Similar presentations


Presentation on theme: "Rapid Global Alignments How to align genomic sequences in (more or less) linear time."— Presentation transcript:

1 Rapid Global Alignments How to align genomic sequences in (more or less) linear time

2 Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

3 The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

4 Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j  L is implemented as a balanced binary tree y h l

5 Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” – remove it In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) V(b) V(a)

6 Sparse DP for rectangle chaining Go through rectangle x-coordinates, from left to right: 1.When on the leftmost end of rectangle i, compute V(i) a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i, possibly store V(i) in L: a.j: rectangle in L, with largest l j  l i b.If V(i) > V(j): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l k, V(k), k) with V(k)  V(i) & l k  l i i j

7 Example x y 1: 5 3: 3 2: 6 4: 4 5: 2 2 5 6 9 10 11 12 14 15 16

8 Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

9 Putting it All Together: Fast Global Alignment Algorithms 1.FIND local alignments 2.CHAIN local alignments FINDCHAIN GLASS: k-mers hierarchical DP MumMer:Suffix Treesparse DP Avid:Suffix Treehierarchical DP LAGANCHAOSsparse DP

10 LAGAN: Pairwise Alignment 1.FIND local alignments 2.CHAIN local alignments 3.DP restricted around chain

11 LAGAN 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

12 LAGAN: recursive call What if a box is too large?  Recursive application of LAGAN, more sensitive word search

13 A trick to save on memory “necks” have tiny tracebacks …only store tracebacks

14 Multiple Sequence Alignments

15

16

17 Overview Definition Scoring Schemes Algorithms

18 Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments

19 Scoring Function Ideally:  Find alignment that maximizes probability that sequences evolved from common ancestor, according to some phylogenetic model More on phylogenetic models later x y z w v ?

20 Scoring Function A comprehensive model would have too many parameters, too inefficient to optimize Possible simplifications  Ignore phylogenetic tree  Statistically independent columns: S(m) = G(m) +  i S(m i ) m: alignment matrix G: function penalizing gaps

21 Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

22 Sum Of Pairs (cont’d) The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments S(m) =  k<l s(m k, m l ) s(m k, m l ):score of induced alignment (k,l)

23 Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

24 Consensus -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC CAG-CTATCAC--GACCGC----TCGATTTGCTCGAC CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Find optimal consensus string m * to maximize S(m) =  i s(m *, m i ) s(m k, m l ):score of pairwise alignment (k,l)

25 Multiple Sequence Alignments Algorithms

26 1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

27 Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } 1. Multidimensional Dynamic Programming

28 Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) 1. Multidimensional Dynamic Programming

29 2. Progressive Alignment Multiple Alignment is NP-complete Most used heuristic: Progressive Alignment Algorithm: 1.Align two of the sequences x i, x j 2.Fix that alignment 3.Align a third sequence x k to the alignment x i,x j 4.Repeat until all sequences are aligned Running Time: O( N L 2 )

30 2. Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree Example: Order of alignments: 1. (x,y) 2. (z,w) 3. (xy, zw) x w y z

31 CLUSTALW: progressive alignment CLUSTALW: most popular multiple protein alignment Algorithm: 1.Find all d ij : alignment dist (x i, x j ) 2.Construct a tree (Neighbor-joining hierarchical clustering) 3.Align nodes in order of decreasing similarity + a large number of heuristics

32 CLUSTALW & the CINEMA viewer

33 MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat

34 MLAGAN: main steps Given a collection of sequences, and a phylogenetic tree 1.Find local alignments for every pair of sequences x, y 2.Find anchors between every pair of sequences, similar to LAGAN anchoring 3.Progressive alignment Multi-Anchoring based on reconciling the pairwise anchors LAGAN-style limited-area DP 4.Optional refinement steps

35 MLAGAN: multi-anchoring X Z Y Z X/Y Z To anchor the (X/Y), and (Z) alignments:

36 Heuristics to improve multiple alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

37 Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

38 Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Note: Guaranteed to converge

39 Iterative Refinement For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary

40 Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

41 Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

42 Restricted MDP Here is another way to improve a multiple alignment: 1.Construct progressive multiple alignment m 2.Run MDP, restricted to radius R from m Running Time: O(2 N R N-1 L)

43 Restricted MDP Run MDP, restricted to radius R from m x y z Running Time: O(2 N R N-1 L)

44 Restricted MDP x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Within radius 1 of the optimal  Restricted MDP will fix it.

45 Optional refinement steps in MLAGAN Limited-area iterative refinement Radius-r 3-sequence refinement on each node of the tree


Download ppt "Rapid Global Alignments How to align genomic sequences in (more or less) linear time."

Similar presentations


Ads by Google