An EM Algorithm for Inferring the Evolution of Eukaryotic Gene Structure Liran Carmel, Igor B. Rogozin, Yuri I. Wolf and Eugene V. Koonin NCBI, NLM, National.

Slides:



Advertisements
Similar presentations
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.
No similarity vs no homology If two (complex) sequences show significant similarity in their primary sequence, they have shared ancestry, and probably.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
04/02/2006RECOMB 2006 Detecting MicroRNA Targets by Linking Sequence, MicroRNA and Gene Expression Data Joint work with Quaid Morris (2) and Brendan Frey.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Network Morphospace Andrea Avena-Koenigsberger, Joaquin Goni Ricard Sole, Olaf Sporns Tung Hoang Spring 2015.
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Adaptive evolution of bacterial metabolic networks by horizontal gene transfer Chao Wang Dec 14, 2005.
Lecture 5: Learning models using EM
Phylogenetic Trees Presenter: Michael Tung
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
National Center for Biotechnology Information Evolution of eukaryotic genomes: remarkable conservation and massive loss of genes and introns Eugene V.
Part 4 c Baum-Welch Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
It & Health 2010 Summary Thomas Nordahl Petersen.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
. Phylogenetic Trees Lecture 13 This class consists of parts of Prof Joe Felsenstein’s lectures 4 and 5 taken from:
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Chapter 24 ~ The Origin of Species
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
1 Robot Environment Interaction Environment perception provides information about the environment’s state, and it tends to increase the robot’s knowledge.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
26.1 Organisms Evolve Through Genetic Change Occurring Within Populations. “Nothing in Biology makes sense except in the light of Evolution” –Theodosius.
Overview  Introduction  Biological network data  Text mining  Gene Ontology  Expression data basics  Expression, text mining, and GO  Modules and.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Chapter 17 Population Genetics and Speciation. Population genetics – the study of the frequency and interaction of alleles and genes in populations. *Microevolution.
Variational Bayesian Methods for Audio Indexing
Bioinformatics and Computational Biology
Cis-regulatory Modules and Module Discovery
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Phylogeny Ch. 7 & 8.
Chapter 3 The Interrupted Gene.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
CISC667, S07, Lec25, Liao1 CISC 467/667 Intro to Bioinformatics (Spring 2007) Review Session.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Bioinformatics Overview
Maximum likelihood (ML) method
Multiple Alignment and Phylogenetic Trees
Bayesian inference Presented by Amir Hadadi
Summary and Recommendations
Chapter 4 The Interrupted Gene.
Gautam Dey, Tobias Meyer  Cell Systems 
Volume 13, Issue 24, Pages (December 2003)
Phylogeny and the Tree of Life
Summary and Recommendations
Study phylogeny in the context of species evolution
Presentation transcript:

An EM Algorithm for Inferring the Evolution of Eukaryotic Gene Structure Liran Carmel, Igor B. Rogozin, Yuri I. Wolf and Eugene V. Koonin NCBI, NLM, National Institutes of Health

Outline Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

GUAG splicing exon1exon2 intron mRNA What are Exons and Introns exon1exon2

Related work 234 Gilbert 2005 [hybrid; branch- specific] Koonin 2003 [Dollo Parsimony] Csuros 2005 [ML; branch- specific] Kenmochi 2005 [ML; branch- specific] Stolzfus 2004 [Bayes; gene- specific] gain Stolzfus Koonin, Kenmochi, Csuros Gilbert Koonin, Kenmochi, Csuros loss

Outline Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Phylogenetic tree

HS …ATGTCGATCGTGCTCGTCGTACTCTCGTAC… DM …ATGTGGATCGTGCTCGTCGTACTCTCGTAC… CE …ATGTGGATTGTGCTCGTCGTACTCTCGTAC… AT …ATGTTGATGGTGCTCGTCGTACTCTCGTAC… SC …ATGTTGATTGTGCTCGTCGTACTCTCGTAC… SP …ATGTTGATT---CTCGTCGTACTCTCGTAC… Multiple alignment

SC SP CE DM HS AT Strong phyletic signal Presence/absence maps (proteasome component C3)

Missing data HS …ATGTCGATCGTGCTCGTCGTACTCTCGTAC… DM …ATGTGGATCGTGCTCGTCGTACTCTCGTAC… CE …ATGTGGATTGTGCTCGTCGTACTCTCGTAC… AT …ATGTTGATGGTGCTCGTCGTACTCTCGTAC… SC …ATGTTGATTGTGCTCGTCGTACTCTCGTAC… SP …ATGTTGATT---CTCGTCGTACTCTCGTAC… ?

Missing data (proteasome component C3) SC SP ? CE DM HS ? AT

Bayesian Network

Outline Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Probability structure descendant in state 0descendant in state 1 parent in state 0 parent in state 1 root prior probability: transition probability:for gene and branch of length branch-specific loss gene-specific loss branch- specific gain gene- specific gain

Rate variation across sites gain variation loss variation shape parameter (gain) fraction of invariant sites shape parameter (loss)

Parameter Summary Global parameters – probability for intron absence in the root – fraction of invariant sites – shape parameters of the gamma distribution Gene-specific parameters – gain rate – loss rate Branch-specific parameters – gain coefficient – loss coefficient

Homogeneous vs. Heterogeneous Evolution The number of parameters in the model number of extant species number of genes Homogeneous Evolution setting G = 1 Heterogeneous Evolution fixing global parameters and branch-specific parameters

Outline Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Likelihood maximization via Expectation Maximization E-Step inward-outward recursions on the tree member in the junction-tree algorithms family missing data are naturally embedded

Inward (gamma) recursion ? ? ? ? ? ? q

Inward (gamma) recursion - Initialization

Inward (gamma) recursion - Recursion q

Outward (alpha) recursion

Likelihood maximization via EM E-Step inward-outward recursions on the tree member in the junction-tree algorithms family missing data are naturally embedded M-Step low-tolerance variable-by-variable maximization Newton-Raphson

Outline Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Intron density in ancient eukaryotes 234 Gilbert 2005 Koonin 2003 Csuros 2005 Kenmochi 2005 Stolzfus 2004

Evolutionary Landscape loser gainer stable dynamic

Modes of Evolution

loser gainer stable dynamic

Outline 234 genes 295 genes 187 genes Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Gene Characteristics New features of genes: Intron gain rate Intron loss rate Old features of genes: Expression level Evolutionary rate Lethality Connectivity in protein-protein interactions Connectivity in genetic interactions

Combined Features

Important genes gain introns StatusAdaptabilityreactivity Gain rate Loss rate

Outline Background and Related Work Data Components The Model The Algorithm Results – Homogeneous Evolution Results – Heterogeneous Evolution Summary

Conclusions Disparate landscape – both gain and loss play role in intron evolution The common ancestor of the crown group had an intron content comparable to fungi, apicomlexans and dipterans Three modes of evolution – more than one mechanism? Important genes tend to gain introns