Annotation and Alignment of the Drosophila Genomes.

Slides:



Advertisements
Similar presentations
Polynomial dynamical systems over finite fields, with applications to modeling and simulation of biological networks. IMA Workshop on Applications of.
Advertisements

Discrete models of biological networks Segunda Escuela Argentina de Matematica y Biologia Cordoba, Argentina June 29, 2007 Reinhard Laubenbacher Virginia.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Structural bioinformatics
Parametric Inference and Drosophila Alignments Female Male Karyotype A project to compare and contrast Drosophila.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Tropical Geometry for Biology Lior Pachter and Bernd Sturmfels Department of Mathematics U.C. Berkeley.
Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter.
Algebraic Statistics for Computational Biology Lior Pachter and Bernd Sturmfels Ch.5: Parametric Inference R. Mihaescu Παρουσίαση: Aγγελίνα Βιδάλη Αλγεβρικοί.
CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Lior, Bernd & Seth Algebraic Statistics for Computational Biology.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Robust Alignment of Drosophila Genomes Lior Pachter EECS Joint Colloquium, October 5th 2005.
Discrete models of biochemical networks Algebraic Biology 2007 RISC Linz, Austria July 3, 2007 Reinhard Laubenbacher Virginia Bioinformatics Institute.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Parametric Inference for Biological Sequence Analysis Lior Pachter and Bernd Sturmfels Mathematics Dept., U.C. Berkeley.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Annotation and Alignment of the Drosophila Genomes.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Developing Pairwise Sequence Alignment Algorithms
Annotation and Alignment of the Drosophila Genomes Centro de Ciencas Genomicas, May 29, 2006.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Cis-regulatory Modules and Module Discovery
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
RBP1 Splicing Regulation in Drosophila Melanogaster Fall 2005 Jacob Joseph, Ahmet Bakan, Amina Abdulla This presentation available at
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
What is genomics? Genes, promoters, regulatory elements, alignments, trees, …
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Learning to Align: a Statistical Approach
CS502: Algorithms in Computational Biology
Sequence comparison: Local alignment
Pairwise Sequence Alignment
A Hybrid Algorithm for Multiple DNA Sequence Alignment
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Intro to Alignment Algorithms: Global and Local
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Multiple Sequence Alignment
Presentation transcript:

Annotation and Alignment of the Drosophila Genomes

One (possibly wrong) alignment is not enough: the history of parametric inference 1992: Waterman, M., Eggert, M. & Lander, E. Parametric sequence comparisons, Proc. Natl. Acad. Sci. USA 89, : Gusfield, D., Balasubramanian, K. & Naor, D. Parametric optimization of sequence alignment, Algorithmica 12, : Wang, L., Zhao, J. Parametric alignment of ordered trees, Bioinformatics, : Fernández-Baca, D., Seppäläinen, T. & Slutzki, G. Parametric Multiple Sequence Alignment and Phylogeny Construction, Journal of Discrete Algorithms, XPARAL by Kristian Stevens and Dan Gusfield

Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. Alignment with biology rather than for biology.

Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. CTGAAGGAAT TCTATATT AAAGAAGATTTCTCATCATTGGTTG CTGCGGGATTAGGGGTCATTAGAGT GCCGAAAAGCGA GTTT CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- CTGCGGGATTAGGAGTCATTAGAGT GCGGAAAAGCGG GTT- CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA CTGCGGGATTAGCGGTCATTGGTGT GAAGAATAGATC CTTT analysis

Whole Genome Parametric Alignment Colin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and Kevin Woods Mathematics and Computer Science Parametric alignment in higher dimensions. Faster new algorithms. Deeper understanding of alignment polytopes. Biology Whole genome parametric alignment. Biological implications of alignment parameters. CTGAAGGAAT TCTATATT AAAGAAGATTTCTCATCATTGGTTG CTGCGGGATTAGGGGTCATTAGAGT GCCGAAAAGCGA GTTT CTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG---- CTGCGGGATTAGGAGTCATTAGAGT GCGGAAAAGCGG GTT- CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA CTGCGGGATTAGCGGTCATTGGTGT GAAGAATAGATC CTTT analysis

computational geometry

A Whole Genome Parametric Alignment of D. Melanogaster and D. Pseudoobscura Divided the genomes into 1,116,792 constrained and 877,982 unconstrained segment pairs. 2d, 3d, 4d, and 5d alignment polytopes were constructed for each of the 877,802 unconstrained segment pairs. Computed the Minkowski sum of the 877,802 2d polytopes. + =

A Whole Genome Parametric Alignment of D. Melanogaster and D. Pseudoobscura Divided the genomes into 1,116,792 constrained and 877,982 unconstrained segment pairs. This is an orthology map of the two genomes. 2d, 3d, 4d, and 5d alignment polytopes were constructed for each of the 877,802 unconstrained segment pairs. For each segment pair, obtain all possible optimal summaries for all parameters in a Needleman--Wunsch scoring scheme. Computed the Minkowski sum of the 877,802 2d polytopes. There are only 838 optimal alignments of the two Drosophila genomes if the same match, mismatch and gap parameters are used for all the segment pair alignments.

>mel CTGCGGGATTAGGGGTCATTAGAGTGCCGA AAAGCGAGTTTATTCTATGGAC >pse CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGA GGAGAGGCCATCATCGTGTAC How do we build the polytope for ?

Alignment polytopes are small Theorem : The number of vertices of an alignment polytope for two sequences of length n and m is O((n+m) d(d-1)/(d+1) ) where d is the number of free parameters in the scoring scheme. Examples : Parameters Model Vertices M,X,S Jukes-Cantor with linear gap penalty O(n+m) 2/3 M,X,S,G Jukes-Cantor with affine gap penalty O(n+m) 3/2 M,X TS,X TV,S,G K2P with affine gap penalty O(n+m) 12/5 L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p L. Pachter and B. Sturmfels, Tropical geometry of statistical models, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p L. Pachter and B. Sturmfels (eds.), Algebraic Statistics for Computational Biology, Cambridge University Press.

The algebraic statistical model for sequence alignment, known as the pair hidden Markov model, is the image of the map The logarithms of the parameters  give the edge lengths for the shortest path problem on the alignment graph.

Newton Polytope of a Polynomial Definition: The Newton polytope of a polynomial is defined to be the convex hull of the lattice points in R d corresponding to monomials in f: 14

NP i,j = S*NP i-1,j +S*NP i,j-1 +(X or M)*NP i-1,j-1 A A C A T T A G A AGATTACCACA Newton polytope for positions [1,i] and [1,j] in each sequence Convex hull of union Minkowski sum Polytope propagation

BP England, U Heberlein, R Tjian. Purified Drosophila transcription factor, Adh distal factor-1 (Adf-1), binds to sites in several Drosophila promoters and activates transcription, J Biol Chem Back to Adf1

Drosophila DNase I Footprint Database (v2.0) HomeSearchBrowse by TargetBrowse by Factor Target Gene Chromosom e Arm StartStop Transcriptio n Factor Pubmed ID (PMID) Footprint ID (FPID) Footprint Alignment ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: dpp (CG9885)2L Adf1 (CG15845) Adf1- >dpp: Adh (CG3481)2L Adf1 (CG15845) Adf1- >Adh: Ddc (CG10697)2L Adf1 (CG15845) Adf1- >Ddc: Antp (CG1028)3R Adf1 (CG15845) Adf1- >Antp: Adh (CG3481)2L Adf1 (CG15845) Adf1- >Adh: Antp (CG1028)3R Adf1 (CG15845) Adf1- >Antp: Antp (CG1028)3R Adf1 (CG15845) Adf1- >Antp:006448

Back to Adf1 mel TGTGCGTCAGCGTCGGCCGCAACAGCG pse TGT GACTGCG *** ** *** BLASTZ alignment

Back to Adf1 mel TGTGCGTCAGCGTCGGCCGCAACAGCG pse TGT GACTGCG *** ** *** mel TGTG----CGTCAGC--G----TCGGCC---GC-AACAG-CG Pse TGTGACTGCG-CTGCCTGGTCCTCGGCCACAGCCAAC-GTCG **** ** * ** * ****** ** *** * **

Back to Adf1 mel TGTGCGTCAGCGTCGGCCGCAACAGCG pse TGT GACTGCG *** ** *** mel TGTG----CGTCAGC--G----TCGGCC---GC-AACAG-CG pse TGTGACTGCG-CTGCCTGGTCCTCGGCCACAGCCAAC-GTCG **** ** * ** * ****** ** *** * ** mel TGTGCGTCAGC------GTCGGCCGCAACAGCG pse TGTGACTGCGCTGCCTGGTCCTCGGCCACAGC- **** * ** *** * ** *****

Drosophila DNase I Footprint Database (v2.0) HomeSearchBrowse by TargetBrowse by Factor Target Gene Chromosom e Arm StartStop Transcriptio n Factor Pubmed ID (PMID) Footprint ID (FPID) Footprint Alignment ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: ems (CG2988)3R Abd-B (CG11648) Abd-B- >ems: dpp (CG9885)2L Adf1 (CG15845) Adf1- >dpp: Adh (CG3481)2L Adf1 (CG15845) Adf1- >Adh: Ddc (CG10697)2L Adf1 (CG15845) Adf1- >Ddc: Antp (CG1028)3R Adf1 (CG15845) Adf1- >Antp: Adh (CG3481)2L Adf1 (CG15845) Adf1- >Adh: Antp (CG1028)3R Adf1 (CG15845) Adf1- >Antp: Antp (CG1028)3R Adf1 (CG15845) Adf1- >Antp:006448

Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value E-5 Distribution comparison KS p-value E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9%

Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value E-5 Distribution comparison KS p-value E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% 80.4%

Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value E-5 Distribution comparison KS p-value E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% 85.1%

Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value E-5 Distribution comparison KS p-value E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% 86.5%

Per site analysisGroup 1 mean per site % identity51.3% 47.8% Group 2 mean per site % identity47.8%42.9% Difference of means (group 1 – group 2)3.6%8.4%4.9% Difference of means resampling p-value E-5 Distribution comparison KS p-value E-6 Per base analysisGroup 1 mean per base % identity47.8% 46.3% Group 2 mean per base % identity46.3%42.4% Difference of means (group 1 – group 2)1.5%5.4%3.9% 79.1%

Applications Conservation of cis-regulatory elements Phylogenetics: branch length estimation This is the expected number of mutations per site in an alignment with summary (x,s). Jukes-Cantor correction:

Applications Conservation of cis-regulatory elements Phylogenetics: branch length estimation