Presentation is loading. Please wait.

Presentation is loading. Please wait.

Robust Alignment of Drosophila Genomes Lior Pachter EECS Joint Colloquium, October 5th 2005.

Similar presentations


Presentation on theme: "Robust Alignment of Drosophila Genomes Lior Pachter EECS Joint Colloquium, October 5th 2005."— Presentation transcript:

1 Robust Alignment of Drosophila Genomes Lior Pachter EECS Joint Colloquium, October 5th 2005

2 GGGCTGGGCGAGTATCTCTTCGAAAGGCTCACTCTCAAGCACGACTAAGAGCCTTCTGAGC...... What is genomics?...

3 GGGCTGGGCGAGTATCTCTTCGAAAGGCTCACTCTCAAGCACGACTAAGAGCCTTCTGAGC...... What is genomics? GLGEYLFERLTLKHD*.....

4 What is genomics? TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCG AACAC........

5 What is genomics? TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCG AACAC.......

6 What is genomics? TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCG AACAC....... ATGGAGT........ microRNA

7 What is comparative genomics? TTCCCTAG--------CAAGTACCTCA------------------ TTCCCTAG--------CAAGTACCTCA------------------ TTCCCTAG--------CAAGTACCTCA------------------ TTCCTTAGACTCTTAGCAAGTACCTCA------------------ TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCG AACACGACTCT---- TTTTAGCAAGTACCTCAAAATATTTAATTAAA-AC ACTCTT- ---TTTTAGCAAGTACCTCAAGAATTACAATTAAATAT TTCCTTAGACTCTTAGAAAGTACCTCAAAAACGAAATGCGAACAC Grun et al. microRNA target predictions across seven Drosophila species and comparison to mammalian targets, PloS Computational Biology, June 2005 Lall et al. A genome wide map of conserved microRNA targets in C. Elegans, submitted to Cell, 2005. ATGGAGT........ let-7

8 The Drosophila Genome Project 1911 Genetic Mapping in Drosophila Sturtevant and Morgan 2000 Drosophila melanogaster genome sequenced Celera and LBNL publish Drosophila genome in Science 2003 Proposal for Drosophila as a model system for comparative genomics Clark, Gibson, Kaufman, McAllister, Myers, O’Grady 2005 Twelve Drosophila genomes sequenced Consortium involving Agencourt, Broad Institute, Baylor College Medicine, Washington University St. Louis and the Venter Institute.

9

10 Drosophila Projects Transposable Element Annotation A. Caspi and L. Pachter, Identification of transposable elements using multiple alignments of related genomes, Genome Research, in press. Multiple Sequence Alignment C. Dewey and L. Pachter, Whole Genome Mapping, in preparation. A.S. Schwartz, E.W. Myers and L. Pachter, Alignment metric accuracy, submitted. N. Bray and L. Pachter, MAVID: Constrained ancestral alignment of multiple sequences, Genome Research 14 (2004), p 693--699. Gene Finding S. Chatterji and L. Pachter, Multiple organism gene finding by collapsed Gibbs sampling, Journal of Computational Biology, 12 (2005), p 599--608. S. Chatterji and L. Pachter, GeneMapper: Evidence based multiple organism gene finding, in preparation.

11 Drosophila Projects Transposable Element Annotation A. Caspi and L. Pachter, Identification of transposable elements using multiple alignments of related genomes, Genome Research, in press. Multiple Sequence Alignment C. Dewey and L. Pachter, Whole Genome Mapping, in preparation. A.S. Schwartz, E.W. Myers and L. Pachter, Alignment metric accuracy, submitted. N. Bray and L. Pachter, MAVID: Constrained ancestral alignment of multiple sequences, Genome Research 14 (2004), p 693--699. Parametric alignment Gene Finding S. Chatterji and L. Pachter, Multiple organism gene finding by collapsed Gibbs sampling, Journal of Computational Biology, 12 (2005), p 599--608. S. Chatterji and L. Pachter, GeneMapper: Evidence based multiple organism gene finding, in preparation.

12 Available Drosophila whole genome multiple alignments MAVID http://hanuman.math.berkeley.edu/kbrowser MULTIZ http://genome.ucsc.edu/ (currently no D. erecta )

13 DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroMoj_20041206_ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC DroSim_20040829_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroVir_20041029_ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * ** Alignment of an exon DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** ** Alignment of an intron

14 DroAna_20041206_ GTCGCTCAACCAGCATTTGCAAAAGTCGCAGAACTTGCGCTCATTGGATTTCCAGTACTC DroMel_4_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroMoj_20041206_ GTCGCTTAACCAGCATTTACAGAAATCGCAATACTTGCGTTCATTGGATTTCCAGTACTC DroPse_1_ GTCGCTCAGCCAGCACTTGCAGAAGTCGCAGTACTTGCGCTCGTTTGATTTCCAGAATTC DroSim_20040829_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTGCGCTCGTTTGATTTCCAGTACTC DroVir_20041029_ GTCGCTCAACCAGCATTTGCAGAAGTCGCAATACTTGCGTTCATTCGACTTCCAGTACTC DroYak_1_ GTCGCTCAGCCAGCATTTGCAGAAGTCGCAGAACTTCCGCTCGTTTGACTTCCAGTACTC ****** * ****** ** ** ** ***** **** ** ** ** ** ****** * ** Alignment of an exon Alignment of an intron droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAA----AGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA------CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGG----AGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAA----AGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA--------TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAAT----AGATCCT-TTATTT *** * * * * droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC dm2.chr2L -----------------------------------------TATGGACTCAC droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC dp3.chr4_group3 -----------------------------------------TGT--ACTTAC droSim1.chr2L -----------------------------------------TATGGACTCAC droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC droYak1.chr2L -----------------------------------------CATAAACTCAC *** **

15 dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC How is an alignment made from the sequences? >dm2.chr2L CTGCGGGATTAGGGGTCATTAGAGTGCCGAAAAGCGAGTTTATTCTATGGACTCAC >dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCGTGTACTTAC ? Given two sequences of lengths n,m : Note that the length of an alignment is at least max(n,m) and at most n+m. n=50 m=62

16 dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroMel_4_ ATTCTATGGACTCAC DroPse_1_ ------TGTACTTAC Each alignment can be summarized by counting the number of matches ( #M ), mismatches ( #X ), gaps ( #G ), and spaces ( #S ).

17 dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGT-TTATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG dm2.chr2L TATGGACTCAC dp3.chr4_group3 TGT--ACTTAC DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroMel_4_ ATTCTATGGACTCAC DroPse_1_ ------TGTACTTAC Each alignment can be summarized by counting the number of matches ( #M ), mismatches ( #X ), gaps ( #G ), and spaces ( #S ). #M=31, #X=22, #G=3, #S=12 #M=27, #X=18, #G=3, #S=28 2(#M+#X)+#S=112 so #X,#G and #S suffice to specify a summary. This notation follows Chapter 7 (Parametric Sequence Alignment) by Colin Dewey and Kevin Woods in the book Algebraic Statistics for Computational Biology.

18 The summary of an alignment is a point in 3 dimensional space. For example, the two alignments just shown correspond to the points: (22,3,12)(18,3,28) In the example of our two sequences there are 379522884096444556699773447791552717765633 different alignments, but only 53890 different summaries. So we don’t need to plot that many points. But 53890 is still quite a large number. Fortunately, there are only 69 vertices on the convex hull of the 53890 points. That is something we can draw…

19 >mel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATTCTATGGAC >pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA GGAGAGGCCATCATCGTGTAC For the sequences: 49 #x=24, #S=10, #G=2 There are eight alignments that have this summary. the alignment polytope is:

20 mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAGA---------GTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC mel CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATC-GTGTAC mel CTGCGGGATTAGGGGTCATTAG---------AGTGCCGAAAAGCGAGTTTATTCTATGGAC pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG-TGTAC

21 mel CTGCGGGATTAGGGGTCATTAGAGT===------===GCCGAAAAGCGAGTTTATTCTA=TGGAC pse CTGGAAGAGTTTTGATTAGTAG===GGGATCCATGGGGGCGAGGAGAGGCCATCATC==GTGTAC Consensus at a vertex

22 The vertices of the polytope have special significance. Given parameters for a model, e.g. the default parameters for MULTIZ: M = 100, X = -100, S = -30, G = -400 the summary is the result of maximizing the linear form -200*(#X)-400*(#G)-80*(#S) over the polytope. Thus, the vertices of the polytope correspond to optimal alignments. 49 #x=24, #S=10, #G=2

23 What is usually done, is that a single set of parameters is specified ( M = 100, X = -100, S = -30, G = -400 is a standard default) and then the optimal vertex is identified using dynamic programming. An alignment optimal for the vertex is then selected. The running time of the algorithm is O(nm) [Needleman-Wunsch, 1970, Smith-Waterman, 1981] and it requires O(n+m) space [Hirschberg 1975]. Standard scoring schemes are: Parameters Model M,X,S Jukes-Cantor with linear gap penalty M,X,S,G Jukes-Cantor with affine gap penalty M,X TS,X TV,S,G Kimura-2 parameter with affine gap penalty Needleman-Wunsch Alignment

24 Available Drosophila whole genome multiple alignments MAVID http://hanuman.math.berkeley.edu/kbrowser MULTIZ http://genome.ucsc.edu/ (currently no D. erecta )

25 DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** ** N. Bray and L. Pachter, MAVID: Constrained ancestral alignment of multiple sequences, Genome Research 14 (2004) p 693--699 MAVID

26 Needleman-Wunsch

27 droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAA----AGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA------CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGG----AGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAA----AGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA--------TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAAT----AGATCCT-TTATTT *** * * * * droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC dm2.chr2L -----------------------------------------TATGGACTCAC droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC dp3.chr4_group3 -----------------------------------------TGT--ACTTAC droSim1.chr2L -----------------------------------------TATGGACTCAC droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC droYak1.chr2L -----------------------------------------CATAAACTCAC *** ** Blanchette et al., Aligning multiple sequences with the threaded blockset aligner, Genome Research 14 (2004) p 708--715 MULTIZ

28 Needleman-Wunsch

29 droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAA----AGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA------CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGG----AGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAA----AGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA--------TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAAT----AGATCCT-TTATTT *** * * * * droAna1.2448876 -----ACTTAC dm2.chr2L TATGGACTCAC droMoj1.contig_2959 TATTGACTCAC dp3.chr4_group3 TGT--ACTTAC droSim1.chr2L TATGGACTCAC droVir1.scaffold_6 GGTCCACTCAC droYak1.chr2L CATAAACTCAC *** **

30 DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** **

31 droAna1.2448876 CTGAAGGAATTCTA--TATTAAAG----------------------------------- dm2.chr2L CTGCGGGATTAGGGGTCATTAGAG---------TGCCGAAA----AGCGAGT-TTATTC droMoj1.contig_2959 CTGGAATAGTTAATTTCATTGTAA---------CACATAAA------CGTTTTAAATTC dp3.chr4_group3 CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGG----AGAGGCCATCATCG droSim1.chr2L CTGCGGGATTAGGAGTCATTAGAG---------TGCGGAAA----AGCGGG--TTATTC droVir1.scaffold_6 CTGCAGCAGTTAA-ATAATTGTAA---------TAAACAA--------TTCTCTAATTT droYak1.chr2L CTGCGGGATTAGCGGTCATTGGTG---------TGAAGAAT----AGATCCT-TTATTT *** * * * * droAna1.2448876 AAGATTTCTCATCATTGGTTGAATC---------------------ACTTAC dm2.chr2L -----------------------------------------TATGGACTCAC droMoj1.contig_2959 -------------------------AAATATTT--------TATTGACTCAC dp3.chr4_group3 -----------------------------------------TGT--ACTTAC droSim1.chr2L -----------------------------------------TATGGACTCAC droVir1.scaffold_6 ---------------------------------AAATATTTGGTCCACTCAC droYak1.chr2L -----------------------------------------CATAAACTCAC *** **

32 DroAna_20041206_ CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG DroMel_4_ CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT DroMoj_20041206_ CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- DroPse_1_ CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- DroSim_20040829_ CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-DroVir_20041029_ CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- DroYak_1_ CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT *** * * * DroAna_20041206_ AATC-----ACTTAC DroMel_4_ ATTCTATGGACTCAC DroMoj_20041206_ ----TATTTACTCAC DroPse_1_ ------TGTACTTAC DroSim_20040829_ ATTCTATGGACTCAC DroVir_20041029_ ----TATTTACTCAC DroYak_1_ ATTTCATAAACTCAC *** **

33 One (possibly wrong) alignment is not enough: the history of parametric inference 1992: Waterman, M., Eggert, M. & Lander, E. Parametric sequence comparisons, Proc. Natl. Acad. Sci. USA 89, 6090-6093 1994: Gusfield, D., Balasubramanian, K. & Naor, D. Parametric optimization of sequence alignment, Algorithmica 12, 312-326. 2003: Wang, L., Zhao, J. Parametric alignment of ordered trees, Bioinformatics, 19 2237-2245. 2004: Fernández-Baca, D., Seppäläinen, T. & Slutzki, G. Parametric Multiple Sequence Alignment and Phylogeny Construction, Journal of Discrete Algorithms, 2 271-287. XPARAL by Kristian Stevens and Dan Gusfield

34 Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. Alignment with biology rather than for biology.

35 Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT- CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT analysis

36 Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTG CTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTT CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA------- CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT- CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA------- CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT analysis

37 computational geometry

38 (#X #S #G)[#alignments] 40 (15,16,16)[1080] 41 (17,30,2)[4] 42 (18,14,5)[4] 43 (18,16,4)[56] 44 (20,10,6)[16] 45 (20,10,7)[24] 46 (23,8,6)[6] 47 (23,8,8)[165] 48 (24,8,3)[38] 49 (24,10,2)[8] 50 (25,8,2)[24] 51 (25,62,3)[2] 52 (28,48,2)[1] 53 (29,8,1)[6] Finding the polytope is called parametric inference. This polytope took 3 seconds to compute using the beneath- beyond method [Grünbaum, Convex Polytopes, 1967].

39

40

41

42

43

44 >mel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATTCTATGGAC >pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA GGAGAGGCCATCATCGTGTAC Associated to every pair of sequences is a polynomial built from the “summaries” of the alignments. 49 #x=24, #S=10, #G=2 corresponds to the monomial 8X 24 S 10 G 2 For example: How do we build the polytope for ?

45 NP i,j = S*NP i-1,j +S*NP i,j-1 +(X or M)*NP i-1,j-1 A A C A T T A G A AGATTACCACA Newton polytope for positions [1,i] and [1,j] in each sequence Convex hull of union Minkowski sum Polytope propagation

46 Complexity of polytope propagation Theorem : The number of vertices of an alignment polytope for two sequences of length n and m is O((n+m) d(d-1)/(d+1) ) where d is the number of free parameters in the scoring scheme. Examples : Parameters Model Vertices M,X,S Jukes-Cantor with linear gap penalty O(n+m) 2/3 M,X,S,G Jukes-Cantor with affine gap penalty O(n+m) 3/2 M,X TS,X TV,S,G K2P with affine gap penalty O(n+m) 12/5 L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138--16143. L. Pachter and B. Sturmfels, Tropical geometry of statistical models, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16132--16137.

47 Inference functions Definition : Given two integers n and m and a scoring scheme for sequence alignment, an inference function assigns to every pair of sequences of lengths n and m respectively, an (optimal) alignment. Remark : The number of inference functions could, in principle, be doubly exponential in n+m. This is because the number of alignments is the Delannoy number D(n,m), which is exponential in n+m, and the number of sequence pairs is 4 n+m.

48 Few inference functions theorem Theorem (S. Elizalde 2005): The number of inference functions for two parameter alignment model with two sequences of length n is  (n 2 ). Proof (outline): 1.The number of inference functions is the number of vertices of the Minkowski sum of the Newton polytopes of the observations. 2.The Newton polytopes are all lattice polytopes, and therefore have few non- parallel edges. 3.The number of vertices of the Minkowski sum is at most where m is the number of non-parallel edges and d is the dimension of the polytopes.

49 Algebraic Statistics -- A language for unifying and developing many of the algorithms for biological sequence analysis - - The few inference functions theorem Polytope propagation Phylogenetic tree reconstruction Evolutionary models Maximum likelihood estimation Mutagenic tree models

50 ATCCAGAAGTCTAGTATACATCTCAAAATTCATGCATCTGGCCGGGCACAGTGGCTCACACCTGCAATCCCAGCACTTTGGGAGGCCGAGGTGGGTGGATTACCTGAGGTCAGGAGTTTA AGACCAGCCTGGCCAACATGGTAAAACCCCATCTCTACTAAAAATACAAGTATTAGCCAGGCATTGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAAAATCACTT GAACCGGGAGGCGGAGGTTGGAGTGAGCTGAGATCGTGCTACCGCACTCCATGCACTCTAGCCTGGGCAACAGAACGAGATGCTGTCACAACAACAACAACAACAACAACAACAACAAC AACAACAACAACAAATTCTCACATCTAAAACAGAGTTCCTGGTTCCATTCCTGCTTCCTGCCTTTCCCACTCCCCCATATTCCCTACCATGCCTTCTTCATCTAATTTAATATTACTAACAAGA TCTATTGTTCAAGCCAAAACCCAAGTGTCACTCCTTCAATTTCTCTTTACCTTATCCTCCAAATTTAATCCATTAGCAAGTCCTCTCTTCAAACCCATCCCAAACCAACCTTGTTTTTAACCAT CTCCACACCACCAATTACCACAAGGATAAAATCTGAATTCCTTACCACCAAATACTATGTGATCTGGCCCTCATCTATGACCTTCTCCCATTCCTTGTGTAATCTCTGCCTCCACACATAATT TGCAAATTACTCCAGCTACACTGGCCTATTATTATTATTATTATTATTTTTGAGACGGAGTCTTGCTCTTTCGCCCAGCCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAATCTCCGCC TCCTGGGTTCAAGCGATTCTCCTGCCCCAGCCTCCCAAGTAGCTGTGATTACAGGCACATGCCACCATTCCCAGCTAATTTTTTTTTGTTTTTGAGATGGAGTTTCACTCTTGTTGCCCAGG CTGGAGTGCAATGGTGCGATCTCAGCTCACCACAACCTCCACCTCCCGGGTTGATGAAGTGATTCTCTTGTCTCAGCCTCCCGTGTAGCTGGGATTAGAGGCACGCGCCACCACGCTGG GCAAATTTTTGTATTTTTAGTAGAGACAGGGTTTCTACCTCAGTGATCTGTCCGCCTTGACCTCCCAAAGTGCTGGGATTACAGGAATGAGCCACCACACCCAGCCGTGCCCAGCTAATTT TTGCATTTTTTAGTAGAGATGGGGTTTTGCCACGTTGGCCAGGCTGGTCTCAAACTCCTGACCTCAGGGGATCTGCCTGCCTCGGCCTCCTAGAGTGCTGGAATTACAGGTGTGAGCCAC TGTGCCCGAACCTTTTATCATTATTATTTCTTGAGACAGGAGTCTTGCTCTGTCGTTCAGGCTGGAGTGCAGTGATGCGATCTTGGCTCACTGTAACTCCTACCTTTCGGTTCAAGTGATTC TCCTGCCTCAGCCTCTGGAGTAGCTGGGATTACAGGCACTGGGATTACAGGCACACACCACCACACCATGCTAGTTTTTTGTATTTTTAGTAGAGATGGGGTTTCACCATGTTGGCCAGG CTGGTCTCGAACTCCTGACCTCAAGTGATTTGCCTGCCTTGGCTTCCCAAAGTGCTGGGATTATAGGCACGAGCCACCACACACGACCAACATTGGCCTATCTTTTAAAAAATAAACCAAG CTCTGGCCGGGCACAGTGGCTCACACCTGTGATCCCAGCACTTTGGGAGGTTGAGGTGGTTGGATCACTTGAGTTCAGGAGTTTGAGACCAGCCTGACCAACGTGGTAAAACCCCATCT CTACTAAAAATAAAAACTAGTCGGGTGTGGTAGCACGCGTGCCTGTAATACCAGCTACTCAGGAGGCCAAGGCAGGAGAATTGCTTGAACCCAGGAGACAGAGTTTGCAGTGAGCCAAG ATTGTGCCACTGCACTCCAGCCTGGGGGATAGAGGGAGACACCATCTCAAAAAAACCAAAATACAGAAATCAAAAAACCACACTCATTATTACCTCAAGACCTTTATGTTTGCTATTCCTCT GCCTATAAGATGCATTCCCTTCATTTTTCAAGGACAATTATTTCTTGTTATTTAGGTCTCAGCTCAATTTTTTCAGAAAGGCTTTCCCTGGCCTCCTTAAACGAAAGTAATCAACAACCTTTGA CAGCTAATACTATTCCACTGTTCTGTATATTTCTCCATAGCATTTATTGTTATCTTAAATTCATCTTTATTGTGTATCTCCCCTCGACAGAACCTGAATCCTACCAGGGACTTAGTTAGTCTTAT TTACTGTTGCATTCCTAGTGCCCAGAACACAGTAGGCTCCCAATAAATAGCCACTGAATAAAAGTTAAAACCAACAAAAATAATCATTTAATTAATTATGAATACATCGAATTGTGCACAATA GTTTATAAAATTACTTTTTTTTTTTTTTTAAGACAGGGTCTCATTCTGTCTCACAGGCTGGAGTGCAGTGGTGCAATCTAGGCTCACTGCAACCTCCGCCTCCCGGGTTCAAGTGATTCTCC TGCCTCAGCCTCCCCAGCAGCTAGGATTACAGGCACATGCCACCACGCTCGACTAATTTTTTTGTGTTTTTAGTAGAGACAAGGTTTCACCATGTTGACCAGGCTGGTCTCGAACTCCTGA CCTCAAGTGATCCACCTGCCTTGGCCACTCAAAGTGCTGGGATTATAGGCATGAGCCACCACGCCTGGCCTATAAAATTACTTTCACATTTCATTTTGCCTGATCTGTTGTCACAGAAGTTC TCAGATGGCTGTTCTGAAATTATTCCTCCTCCTACACTCTATCTTATTTACTTCTCACTGTTCTCAGTATCATAAAGTGCAACATCTTTTTGAAGCAATCTGAATTATAAACAGATACATTTGCA TGTATATATATGTATATATGCATATGCACACACACACTTTTTTTTTTTTAAGAGACAGGGTCTTGCTCTGTGCAAGTGCAAGAGTGCAATGGTATGATCATAGCTCACTGCAGCCTTGAACTC CTGGGCTCAAGTGATTCTTCTGGCTTAGCTTCCTCAGTAGCTAAGACTACAGAAGCACACTGCCATGCCCGGCTAATTAAAAAAAAATTTTGTGGAGACAGAGTCTCACTATGTTGCCCAG GCTGGTTTCAAACTCCTGGCCTCAAGTAATCTTCCTGTCTCAGCCTCCCAAAGGGCTGAGATTATAAGTGTGAGCCACTGCATCTGGACTGCATATTAATATGAAGAGCTTTTCTTCAACAA CAGTGAACAGTTTTCTACAAAGGTATATGCAAGTGGGCCCACTTCTTGTTCTTATGAATCTTTTCTTTCCTTTTATAAAACTCCTTTTCCTTTCTCTTTTCCCCAAAGAAAGGACTGTTTCTTTT GAAATCTAGAACAAATGAGAACAGAGGATATCCTGGTTTGCGCTGCAAAATTTTTTTTTTTTTTAAGACGGAGTCTCGCTCTGTTGCCAGGTTGGAGTGCAGTGGCACGATCTTGGCTCATT GCAACCTCCACCTCCCGGGTTCAAGAGATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGAACTAAAGGCGCATGCCACCACGCTGAGTAATTTTTTGTATTTTAGTAGAGACAGGGTTTCAC CATGTTGCCCAGGCTGATCTCGAACTCCTGAGCTCAGGCAATCTGCCTGTCTTGGCCTCCCACAGTGTTAGGATTACAGGCATGAGCCACTGCACCCGATTTTTTTTTTCTTTTGATGGAG TTTTGCTCTTGTTGCCCAGGTTAGAGTGCAATGATGCGATCTCAGCTCACTGCAACCCCCGCCTCCCAGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGAATTACAGGCA AGTGCCACCAAGCCCGGCTAATTTTGTATTTTTAGTAGAAACGGGGTTTCTCCATGTTGGTCAGGCTGGTCTTGAACTCCCGACATCAGGTGATCCAAGCGCCTCAGCCTCCCAAAGCGC TGGGATTATAGGTATGAGCCACAGTGCAGGCCTGCATAATTCTTGATGATCCTCATTATCATGGAAAATTTGTGCATTGTTAAGGAAAGTGGTGCATTGATGGAAGGAAGCAAATACATTTT TAACTATATGACTGAATGAATATCTCTGGTTAGTTTGTAACATCAAGTACTTACCTCATTCAGCATTTTTCTTTCTTTAATAGACTGGGTCACCCCTAAAGAGATCATAGAAAAGACAGGTTAC ATACAGCAGAAGAACGTGCTCTTTTCACGGAGATAGAGAGGTCAGCGATTCACAAAAGAGCACAGGAAGAATGACAGAGGAGAGGTCCTTCCCTCTAAAGCCACAGCCCTTTAATAAGGC TTGTAGCAGCAGTTTCCTTCTGGAGACAGAGTTGATGTTTAATTTAAACATTATAAGTTTGCCTGCTGCACATGGATTCCTGCCGACTATTAAATAAATCCCTAGCTCATATGCTAACATTGC TAGGAGCAGATTAGGTCCTATTAGTTATAAAAGAGACCCATTTTCCCAGCATCACCAGCTTATCTGAACAAAGTGATATTAAAGATAAAAGTAGTTTAGTATTACAATTAAAGACCTTTTGGT AACTCAGACTCAGCATCAGCAAAAACCTTAGGTGTTAAACGTTAGGTGTAAAAATGCAATTCTGAGGTGTTAAAGGGAGGAGGGGAGAAATAGTATTATACTTACAGAAATAGCTAACTAC CCATTTTCCTCCCGCAATTCCTAGAAAATATTTCAGTGTCCGTTCACACACAAACTCAGCATCTGCAGAATGAAAAACACTCAAAGGATTAGAAGTTGAAAACAAAATCAGGAAGTGCTGTC CTAAGAAGCTAAAGAGCCTCAGTTTTTTACACTCCCAAGATCAATCTGGATTTATGATTCTAAAACCCCTGGTGACAGAATCAGAGGCTGAAAACACCACTAATTATAACCAGCAGGTATGG ATATTTGGAAGTCTAGGGGAGGCTGATATGAAGTTAAGACCAGAGGAAATATCTGTCCACTCCCTCTTCTCAACACCCATCTTCTAGACGCCAAGGCTAGCTATAGATCTCCATTATAGTGT TCAAGGAATTAGGAATTATCCATGTCAATAGTTTTGATTAATGTGGACGGAGAACATCTATATTACTAGATGGCAATATGTGAAAGAAGAAAACAGTATTGTTGAAAACCTAAATCTGAAATG TCAATGTAATGACAAATTTTCACCCCTAGAATGTCTACCTGGGGAGTCCTAACCCTCTAATATTCCCCTGAGAGGGATGGGAGAATACAGTGCAGAGCTTTTATATAAGTATTTCAGAAAGC AGTAGCTAAAGAATCACTTGTTTATTTCCCAGTGTTTCAAAGGCCCTTCTGAAGAACTAAGCAAACTAAGGAAAGACCATTTAGTTTTAAACAGGAGAAATGTATTTAACTAAATCCTAAACA CAGCAGGCTATCTGCAAGCAGCAGCAGCAGCAGCAGCCATGCTCCCTCACAGAATCCTTACAATTTTTGAAGTTTTTTGTTTAACTGCTACAAAAGCCGATTTAGTAACATTTATTACACTT AAAAACTTCAGTTCATTTGTAGTTCAAAGCAAATGTATTGGCTTTGAGTTTAAAGACTGAACTACTTTAGATTTGATTTGCATTTTTTTTTTTTTTTTTTTTTGAGATGCAGTCTTGCTCTGTCA GCCAGGCTGGAGTGCAGTGGCTGGATCTCAGCTCACGGCAAGCTCTGCCTCCTGGGTTCATGCCATTCTCCTGCCTCAGCCTCCTGAGTAGCTGGGACTACAGATGCCCGCCACCATGC CCGGCTAATTTTTTGTATTTTTACTAGAGATGGGGTTTCACCGTGTTAGCCAGGATGGTCTCGATCTCCTGACCTCGTGATCTGCCCGCCTTGGCCCCCCAAAGCGCTGGGATTACAGGC CTGAGCCACCACGCTTGGCATCTTTTTACCTTTCATTAACTTTGATGCAAACCTATAGCTTAAGGTATCTTAAACTTTAATGACATTTTTCTCTAAAATAGTAGTTTGTAATAACTTGTTCTGG CACCTGGCTCCAATGAACACTACCCTCTGACCCTGTGGTATAATTTTCATGAGTAAGTGGAAACCTAAGATCTTAGAAGTTCAACGGCAATGTGTCCAAGGGGTTTAGATCCTCTCCTTAA GTGCCTGTATCTCTGTGAAAAGAATCATCATAGGCTAGGCGCGATGGCTCACACCTGTAATCCCAGCACTTTGGGAGGCCGAGGTAGGTGGATCACCTGAGGTCGGGAGTCCAAGACCA GCCTGACTGACATGGAAAAACCCTGTCTCTACTAAAAATACAAAATTAGGTATGGTGGTGCATTCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAG GGGGAGGTTGCAGCAAGCCAAGATCGTGCCATTGCACTCCAGCAGCCTGGGCAACAAGAGTGAAAAACTACACCTCAAAAACAAAAACAAAAACAAAAGAATCATCATCAAGTGAACTGG AACACATCCAGAGAACTAATTTTGTTAGAAAGATTTTAGAGTTGAGCCACACAATCTGCATCTTCTGCGTCCTCCATGCACTCGTCTGCTTTCTGGAGCCCCATGAGTGAGTCTTAATCCTG TTCCAGATAACAGTTCTCTTCCGGGTAACGGTTCTTCAGATACTTGAAGACAGTGTCTTATTTCCTTAAATCTTCTCATTTCTTCTTCAAAAGACAGTATTTCAAGTTACTTTTATGTATCTTTA CCATCTACCTCTGGATAAACACTCTCCAATTTGTCAGTGACCATGTTAAAAACCAAGCACGGTGCTTAAAACTGACATCATCTTTCAGGCAATCACTCCATTGGAGAATACAGTGGGGCTCT GGATCTGTACTTCACTTGCTCCAGAGCCTCTGCTTGTGTTAATACGGCCCAGTTTCAAATAAGCATTTTTAGCAGCCCTGAAATGTGTACTCAGATTTAGTTTATAGTCAACTAAAAACACCC AGAGGTCTCCTGTATTACACAAGTTATAATTAAAACCTTAAAAGAGAAAGGTATAGGACAAATGATCTGTCTCCTCCCTTTTTTGCTTTTTCATATGTTAAGACTATCTCGGAGCTGTTATCA GACTT


Download ppt "Robust Alignment of Drosophila Genomes Lior Pachter EECS Joint Colloquium, October 5th 2005."

Similar presentations


Ads by Google