Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.

Similar presentations


Presentation on theme: "Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree."— Presentation transcript:

1 Multiple Sequence Alignments Algorithms

2 MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree (LAGAN) Human Baboon Mouse Rat

3 MLAGAN: main steps Given a collection of sequences, and a phylogenetic tree 1.Find local alignments for every pair of sequences x, y 2.Find anchors between every pair of sequences, similar to LAGAN anchoring 3.Progressive alignment Multi-Anchoring based on reconciling the pairwise anchors LAGAN-style limited-area DP 4.Optional refinement steps

4 MLAGAN: multi-anchoring X Z Y Z X/Y Z To anchor the (X/Y), and (Z) alignments:

5 Whole-genome alignment Human/Mouse/Rat

6 Insertion/Deletion Rate Analysis

7 Heuristics to improve multiple alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

8 Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

9 Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Note: Guaranteed to converge

10 Iterative Refinement For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary

11 Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

12 Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

13 Restricted MDP Run MDP, restricted to radius R from m x y z Running Time: O(2 N R N-1 L)

14 Tree Refinement Run 3D-DP, restricted to radius R from m, for each tree node x y z Running Time: ~7R 2 LN R: Radius L: Alignment Length N: Number of Sequences

15 A* for Multiple Alignments Review of the A* algorithm v START GOAL g(v) h(v) g(v) is the cost so far h(v) is an estimate of the minimum cost from v to GOAL f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v 1. Expand v with the smallest f(v) 2. Never expand v, if f(v) ≥ shortest path to the goal found so far

16 A* for Multiple Alignments Nodes: Cells in the DP matrix g(v): alignment cost so far h(v): sum-of-pairs of individual pairwise alignments Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments v START GOAL g(v) h(v)

17 Consistency – T-Coffee z x y xixi yjyj y j’ zkzk

18

19 T – Coffee Layout LALIGNCLUSTALW

20 Generating Primary Library A AA B BB A C B BB C CC ClustalW Primary Library Lalign Primary Library (10 top scoring non–intersecting Local (Global pairwise alignment)(Pairwise alignment) Library has information for each N(N-1)/2 sequence pairs.

21 Primary Library Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT - - - Prim. Weight = 88 Seq A GARFIELD THE LAST FA-T CAT Seq C GARFIELD THE VERY FAST --- Prim. Weight = 77 Seq A GARFIELD THE LAST FAT CAT Seq D - - - -- - -- THE - - - - FAT CAT Prim. Weight = 100 Seq B GARFIELD THE - - - - FAST CAT Seq C GARFIELD THE VERY FAST CAT Prim. Weight = 100 Seq B GARFIELD THE FAST CAT Seq D - - - - -- - THE FA-T CAT Prim. Weight = 100 Seq C GARFIELD THE VERY FAST CAT Seq D - - - -- -- - THE - - - - FA-T CAT Prim. Weight = 100

22 Combining the libraries Seq A GARFIELD THE LAST FAT CAT Seq B GARFIELD THE FAST CAT - - - Primary weight(ClustalW)=88 Primary Weight(Lalign)=88 W(A(G),B(G)) = 88 + 88 = 176 If a pair is duplicated across the two libraries, it is merged into single entry with weight = sum of two weights pairs of residue that did not occur are not present ( weight 0 )

23 Library Extension Complete extension requires examination of all triplets. Not all bring information ( eg. A and B through D ). Weight of a pair =  weights gathered through examination of all triplets involving that pair.

24 Running Time Complexity of entire procedure: O(N 2 * L 2 ) + O(N 3 *L) + O(N 3 ) + O(N*L 2 ) O(N 2 L 2 ) - pair-wise library computation O(N 3 L) - library extension O(N 3 ) - computation of NJ tree O(N L 2 ) - progressive alignment computation Where: L – average sequence length N – number of sequences

25 T-Coffee compared with other methods MethodCat1(81)Cat2(23)Cat3(4)Cat4(12)Cat5(11)Total(141) Dialign71.025.235.174.780.461.5 ClustalW78.532.242.565.774.366.4 Prrp78.632.550.251.182.766.4 T-Coffee80.737.352.983.288.772.1

26 Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov

27 Reading GENSCAN EasyGene SLAM Twinscan Optional: Chris Burge’s Thesis

28 Gene expression Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

29 Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = coding intron = non-coding

30 Finding genes Start codon ATG 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1Intron 2 Stop codon TAG/TGA/TAA Splice sites

31

32

33 Approaches to gene finding Homology  BLAST, Procrustes. Ab initio  Genscan, Genie, GeneID. Hybrids  GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM.

34 HMMs for single species gene finding: Generalized HMMs

35 HMMs for gene finding GTCAGAGTAGCAAAGTAGACACTCCAGTAACGC exon intron intergene

36 GHMM for gene finding TAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTGGGGGGGGGGGGGGGCCCCCCC Exon1Exon2Exon3 duration

37 Observed duration times

38 Better way to do it: negative binomial EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

39 Biology of Splicing (http://genes.mit.edu/chris/)

40 Consensus splice sites (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html) Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996)

41 Splice site detection 5’ 3’ Donor site Position 

42 Splice Site Models WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1 st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997) decision-tree like algorithm to take significant pairwise dependencies into account

43 atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag


Download ppt "Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree."

Similar presentations


Ads by Google