Presentation is loading. Please wait.

Presentation is loading. Please wait.

Annotation and Alignment of the Drosophila Genomes.

Similar presentations


Presentation on theme: "Annotation and Alignment of the Drosophila Genomes."— Presentation transcript:

1 Annotation and Alignment of the Drosophila Genomes

2 One (possibly wrong) alignment is not enough: the history of parametric inference 1992: Waterman, M., Eggert, M. & Lander, E. Parametric sequence comparisons, Proc. Natl. Acad. Sci. USA 89, 6090-6093 1994: Gusfield, D., Balasubramanian, K. & Naor, D. Parametric optimization of sequence alignment, Algorithmica 12, 312-326. 2003: Wang, L., Zhao, J. Parametric alignment of ordered trees, Bioinformatics, 19 2237-2245. 2004: Fernández-Baca, D., Seppäläinen, T. & Slutzki, G. Parametric Multiple Sequence Alignment and Phylogeny Construction, Journal of Discrete Algorithms, 2 271-287. XPARAL by Kristian Stevens and Dan Gusfield

3 Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. Alignment with biology rather than for biology.

4 Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT- CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT analysis

5 Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT- CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT analysis

6 computational geometry

7 A Whole Genome Parametric Alignment of D. Melanogaster and D. Pseudoobscura Divided the genomes into 1,116,792 constrained and 877,982 unconstrained segment pairs. 2d, 3d, 4d, and 5d alignment polytopes were constructed for each of the 877,802 unconstrained segment pairs. Computed the Minkowski sum of the 877,802 2d polytopes. + =

8 A Whole Genome Parametric Alignment of D. Melanogaster and D. Pseudoobscura Divided the genomes into 1,116,792 constrained and 877,982 unconstrained segment pairs. This is an orthology map of the two genomes. 2d, 3d, 4d, and 5d alignment polytopes were constructed for each of the 877,802 unconstrained segment pairs. For each segment pair, obtain all possible optimal summaries for all parameters in a Needleman--Wunsch scoring scheme. Computed the Minkowski sum of the 877,802 2d polytopes. There are only 838 optimal alignments of the two Drosophila genomes if the same match, mismatch and gap parameters are used for all the segment pair alignments.

9

10

11

12

13

14 >mel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATTCTATGGAC >pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA GGAGAGGCCATCATCGTGTAC How do we build the polytope for ?

15 Alignment polytopes are small Theorem : The number of vertices of an alignment polytope for two sequences of length n and m is O((n+m) d(d-1)/(d+1) ) where d is the number of free parameters in the scoring scheme. Examples : Parameters Model Vertices M,X,S Jukes-Cantor with linear gap penalty O(n+m) 2/3 M,X,S,G Jukes-Cantor with affine gap penalty O(n+m) 3/2 M,X TS,X TV,S,G K2P with affine gap penalty O(n+m) 12/5 L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138--16143. L. Pachter and B. Sturmfels, Tropical geometry of statistical models, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16132--16137. L. Pachter and B. Sturmfels (eds.), Algebraic Statistics for Computational Biology, Cambridge University Press.

16 The algebraic statistical model for sequence alignment, known as the pair hidden Markov model, is the image of the map The logarithms of the parameters  give the edge lengths for the shortest path problem on the alignment graph.

17 Newton Polytope of a Polynomial Definition: The Newton polytope of a polynomial is defined to be the convex hull of the lattice points in R d corresponding to monomials in f: 14

18 NP i,j = S*NP i-1,j +S*NP i,j-1 +(X or M)*NP i-1,j-1 A A C A T T A G A AGATTACCACA Newton polytope for positions [1,i] and [1,j] in each sequence Convex hull of union Minkowski sum Polytope propagation

19 BP England, U Heberlein, R Tjian. Purified Drosophila transcription factor, Adh distal factor-1 (Adf-1), binds to sites in several Drosophila promoters and activates transcription, J Biol Chem 1990. Back to Adf1

20 Drosophila DNase I Footprint Database (v2.0) HomeSearchBrowse by TargetBrowse by Factor Target Gene Chromosom e Arm StartStop Transcriptio n Factor Pubmed ID (PMID) Footprint ID (FPID) Footprint Alignment ems (CG2988)3R97238069723816 Abd-B (CG11648) 9491376003205 Abd-B- >ems:003205 ems (CG2988)3R97238439723853 Abd-B (CG11648) 9491376003206 Abd-B- >ems:003206 ems (CG2988)3R97239989724008 Abd-B (CG11648) 9491376003208 Abd-B- >ems:003208 ems (CG2988)3R97240919724102 Abd-B (CG11648) 9491376003209 Abd-B- >ems:003209 ems (CG2988)3R97245269724536 Abd-B (CG11648) 9491376003211 Abd-B- >ems:003211 ems (CG2988)3R97245579724567 Abd-B (CG11648) 9491376003213 Abd-B- >ems:003213 ems (CG2988)3R97246149724624 Abd-B (CG11648) 9491376003214 Abd-B- >ems:003214 dpp (CG9885)2L24546572454685Adf1 (CG15845)7791801003665 Adf1- >dpp:003665 Adh (CG3481)2L1461547214615509Adf1 (CG15845)2105454005046 Adf1- >Adh:005046 Ddc (CG10697)2L1911630319116321Adf1 (CG15845)2318884005464 Adf1- >Ddc:005464 Antp (CG1028)3R28250182825059Adf1 (CG15845)2318884006446 Adf1- >Antp:006446 Adh (CG3481)2L1461617114616209Adf1 (CG15845)2105454005059 Adf1- >Adh:005059 Antp (CG1028)3R28251172825144Adf1 (CG15845)2318884006447 Adf1- >Antp:006447 Antp (CG1028)3R28251512825174Adf1 (CG15845)2318884006448 Adf1- >Antp:006448

21 Back to Adf1 mel TGTGCGTCAGCGTCGGCCGCAACAGCG pse TGT-----------------GACTGCG *** ** *** BLASTZ alignment

22 Back to Adf1 mel TGTGCGTCAGCGTCGGCCGCAACAGCG pse TGT-----------------GACTGCG *** ** *** mel TGTG----CGTCAGC--G----TCGGCC---GC-AACAG-CG Pse TGTGACTGCG-CTGCCTGGTCCTCGGCCACAGCCAAC-GTCG **** ** * ** * ****** ** *** * **

23 Back to Adf1 mel TGTGCGTCAGCGTCGGCCGCAACAGCG pse TGT-----------------GACTGCG *** ** *** mel TGTG----CGTCAGC--G----TCGGCC---GC-AACAG-CG pse TGTGACTGCG-CTGCCTGGTCCTCGGCCACAGCCAAC-GTCG **** ** * ** * ****** ** *** * ** mel TGTGCGTCAGC------GTCGGCCGCAACAGCG pse TGTGACTGCGCTGCCTGGTCCTCGGCCACAGC- **** * ** *** * ** *****

24 Drosophila DNase I Footprint Database (v2.0) HomeSearchBrowse by TargetBrowse by Factor Target Gene Chromosom e Arm StartStop Transcriptio n Factor Pubmed ID (PMID) Footprint ID (FPID) Footprint Alignment ems (CG2988)3R97238069723816 Abd-B (CG11648) 9491376003205 Abd-B- >ems:003205 ems (CG2988)3R97238439723853 Abd-B (CG11648) 9491376003206 Abd-B- >ems:003206 ems (CG2988)3R97239989724008 Abd-B (CG11648) 9491376003208 Abd-B- >ems:003208 ems (CG2988)3R97240919724102 Abd-B (CG11648) 9491376003209 Abd-B- >ems:003209 ems (CG2988)3R97245269724536 Abd-B (CG11648) 9491376003211 Abd-B- >ems:003211 ems (CG2988)3R97245579724567 Abd-B (CG11648) 9491376003213 Abd-B- >ems:003213 ems (CG2988)3R97246149724624 Abd-B (CG11648) 9491376003214 Abd-B- >ems:003214 dpp (CG9885)2L24546572454685Adf1 (CG15845)7791801003665 Adf1- >dpp:003665 Adh (CG3481)2L1461547214615509Adf1 (CG15845)2105454005046 Adf1- >Adh:005046 Ddc (CG10697)2L1911630319116321Adf1 (CG15845)2318884005464 Adf1- >Ddc:005464 Antp (CG1028)3R28250182825059Adf1 (CG15845)2318884006446 Adf1- >Antp:006446 Adh (CG3481)2L1461617114616209Adf1 (CG15845)2105454005059 Adf1- >Adh:005059 Antp (CG1028)3R28251172825144Adf1 (CG15845)2318884006447 Adf1- >Antp:006447 Antp (CG1028)3R28251512825174Adf1 (CG15845)2318884006448 Adf1- >Antp:006448

25 Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value0.050.0031E-5 Distribution comparison KS p-value0.0260.00162E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9%

26 Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value0.050.0031E-5 Distribution comparison KS p-value0.0260.00162E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% 80.4%

27 Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value0.050.0031E-5 Distribution comparison KS p-value0.0260.00162E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% 85.1%

28 Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value0.050.0031E-5 Distribution comparison KS p-value0.0260.00162E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% 86.5%

29 Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value0.050.0031E-5 Distribution comparison KS p-value0.0260.00162E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% 79.1%

30 Applications Conservation of cis-regulatory elements Phylogenetics: branch length estimation This is the expected number of mutations per site in an alignment with summary (x,s). Jukes-Cantor correction:

31 Applications Conservation of cis-regulatory elements Phylogenetics: branch length estimation


Download ppt "Annotation and Alignment of the Drosophila Genomes."

Similar presentations


Ads by Google