Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland.

Slides:



Advertisements
Similar presentations
Bayesian inference Lee Harrison York Neuroimaging Centre 01 / 05 / 2009.
Advertisements

DREAM4 Puzzle – inferring network structure from microarray data Qiong Cheng.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Inferring Quantitative Models of Regulatory Networks From Expression Data Iftach Nachman Hebrew University Aviv Regev Harvard Nir Friedman Hebrew University.
Le Song Joint work with Mladen Kolar and Eric Xing KELLER: Estimating Time Evolving Interactions Between Genes.
Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.
Mechanistic models and machine learning methods for TIMET Dirk Husmeier.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
A Probabilistic Dynamical Model for Quantitative Inference of the Regulatory Mechanism of Transcription Guido Sanguinetti, Magnus Rattray and Neil D. Lawrence.
Consistent probabilistic outputs for protein function prediction William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
1. Elements of the Genetic Algorithm  Genome: A finite dynamical system model as a set of d polynomials over  2 (finite field of 2 elements)  Fitness.
Cs726 Modeling regulatory networks in cells using Bayesian networks Golan Yona Department of Computer Science Cornell University.
Reverse engineering gene and protein regulatory networks using Graphical Models. A comparative evaluation study. Marco Grzegorczyk Dirk Husmeier Adriano.
6. Gene Regulatory Networks
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Gaussian Processes for Transcription Factor Protein Inference Neil D. Lawrence, Guido Sanguinetti and Magnus Rattray.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Statistical Bioinformatics QTL mapping Analysis of DNA sequence alignments Postgenomic data integration Systems biology.
Cis-regulation Trans-regulation 5 Objective: pathway reconstruction.
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Reverse Engineering of Genetic Networks (Final presentation)
Reverse engineering gene regulatory networks Dirk Husmeier Adriano Werhli Marco Grzegorczyk.
Learning regulatory networks from postgenomic data and prior knowledge Dirk Husmeier 1) Biomathematics & Statistics Scotland 2) Centre for Systems Biology.
Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland.
Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Inferring gene regulatory networks with non-stationary dynamic Bayesian networks Dirk Husmeier Frank Dondelinger Sophie Lebre Biomathematics & Statistics.
Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier.
Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics.
MCMC in structure space MCMC in order space.
Dependency networks Sushmita Roy BMI/CS 576 Nov 25 th, 2014.
Introduction to biological molecular networks
DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
BAYESIAN INFERENCE OF SIGNALING NETWORK TOPOLOGY IN A CANCER CELL LINE Steven M. Hill, Yiling Lu, Jennifer Molina, Laura M. Heiser, Paul T. Spellman, Terence.
Reverse engineering of regulatory networks Dirk Husmeier & Adriano Werhli.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Mechanistic models and machine learning methods for TIMET
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Bayesian inference Lee Harrison York Neuroimaging Centre 23 / 10 / 2009.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
1 CISC 841 Bioinformatics (Fall 2008) Review Session.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Canadian Bioinformatics Workshops
Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Probability Theory and Parameter Estimation I
Learning gene regulatory networks in Arabidopsis thaliana
Bud Mishra Professor of Computer Science and Mathematics 12 ¦ 3 ¦ 2001
Recovering Temporally Rewiring Networks: A Model-based Approach
CSCI 5822 Probabilistic Models of Human and Machine Learning
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Estimating Networks With Jumps
Filtering and State Estimation: Basic Concepts
1 Department of Engineering, 2 Department of Mathematics,
Multivariate Methods Berlin Chen
Volume 137, Issue 1, Pages (April 2009)
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland

Overview Introduction Methodology Circadian regulation in Arabidopsis Application to synthetic biology DREAM

Network reconstruction from postgenomic data

Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks

Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks

direct interaction common regulator indirect interaction co-regulation Pairwise associations do not take the context of the systeminto consideration Shortcomings

Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks

Conditional independence graphs (CIGs) Direct interaction Partial correlation, i.e. correlation conditional on all other domain variables Corr(X 1,X 2 |X 3,…,X n ) strong partial correlation π 12 Inverse of the covariance matrix

CorrelationPartial correlation high high high high low

Conditional Independence Graphs (CIGs) Direct interaction Partial correlation, i.e. correlation conditional on all other domain variables Corr(X 1,X 2 |X 3,…,X n ) Problem: #observations < #variables  Covariance matrix is singular strong partial correlation π 12 Inverse of the covariance matrix

Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks

Regulatory network

Description with differential equations Rates Concentrations Kinetic parameters q

Model Parameters q Probability theory  Likelihood

1) Practical problem: numerical optimization q 2) Conceptual problem: overfitting ML estimate increases on increasing the network complexity

Overfitting problem True pathway Poorer fit to the data Equal or better fit to the data

Regularization E.g.: Bayesian information criterion (BIC) Maximum likelihood parameters Number of parameters Number of data points Data misfit term Regularization term

Complexity LikelihoodBIC

Model selection: find the best pathway Select the model with the highest posterior probability: This requires an integration over the whole parameter space:

MCMC based schemes q Problem: excessive computational costs

Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks

Friedman et al. (2000), J. Comp. Biol. 7, Marriage between graph theory and probability theory

Bayes net ODE model

Model Parameters q Bayesian networks: integral analytically tractable!

UAI 1994

Example: 2 genes  16 different network structures Compute

Identify the best network structure Ideal scenario: Large data sets, low noise

Uncertainty about the best network structure Limited number of experimental replications, high noise

Sample of high-scoring networks

Feature extraction, e.g. marginal posterior probabilities of the edges

Sample of high-scoring networks Feature extraction, e.g. marginal posterior probabilities of the edges High-confident edge High-confident non-edge Uncertainty about edges

Number of structures Number of nodes Sampling with MCMC

UAI 1994

Model Parameters q Bayesian networks: integral analytically tractable!

[A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise Linearity assumption A P1 P2 P4 P3 w1 w4 w2 w3

Homogeneity assumption Parameters don’t change with time

Homogeneity assumption Parameters don’t change with time

Limitations of the homogeneity assumption

Overview Introduction Methodology Circadian regulation in Arabidopsis Application to synthetic biology DREAM

Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks

Example: 4 genes, 10 time points t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Standard dynamic Bayesian network: homogeneous model

Limitations of the homogeneity assumption

Our new model: heterogeneous dynamic Bayesian network. Here: 2 components t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Our new model: heterogeneous dynamic Bayesian network. Here: 3 components

Extension of the model q

q k h Number of components (here: 3) Allocation vector

Analytically integrate out the parameters q k h Number of components (here: 3) Allocation vector

Non-homogeneous model  Non-linear model

[A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise BGe: Linear model A P1 P2 P4 P3 w1 w4 w2 w3

Can we get an approximate nonlinear model without data discretization? y x

Idea: piecewise linear model y x

t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Inhomogeneous dynamic Bayesian network with common changepoints

Inhomogenous dynamic Bayesian network with node-specific changepoints t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

Overview Introduction Methodology Circadian regulation in Arabidopsis Application to synthetic biology DREAM

Circadian regulation in Arabidopsis thaliana

Collaboration with the Institute of Molecular Plant Sciences at Edinburgh University (Andrew Millar’s group) - Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 - Four time series measured under constant light condition at 13 time points: 0h, 2h,…, 24h, 26h - Seedlings entrained with different light:dark cycles between 10h:10h (T 20 ) and 14h:14h (T 28 ). Circadian rhythms in Arabidopsis thaliana

Posterior probability of changepoints

Sample of high-scoring networks

Marginal posterior probabilities of the edges P=1 P=0 P=0.5 Predict an interaction if marginal posterior probability > 0.5

Plant Clockwork from the literature Review – Rob McClung, Plant Cell 2006 Two major gene classes… Morning genes e.g. LHY, CCA1 … repress evening genes e.g. TOC1, ELF3, ELF4, GI, LUX … which activate LHY and CCA1

CCA1 LHY PRR9 GI ELF3 TOC1 ELF4 PRR5 PRR3 False negative Which interactions from the literature are found? True positive Blue: activations Red: Inhibitions

CCA1 LHY PRR9 GI ELF3 TOC1 ELF4 PRR5 PRR3 False negative Which interactions from the literature are found? True positive Blue: activations Red: Inhibitions True positives (TP) = 8 False negatives (FN) = 5 Recall= 8/13= 62%

Which proportion of predicted interactions are confirmed by the literature? False positives Blue: activations Red: Inhibitions True positive

Which proportion of predicted interactions are confirmed by the literature? False positives Blue: activations Red: Inhibitions True positive True positives (TP) = 8 False positives (FP) = 13 Precision = 8/21= 38%

Precision= 38% CCA1 LHY PRR9 GI ELF3 TOC1 ELF4 PRR5 PRR3 Recall= 62%

True positives (TP) = 8 False positives (FP) = 13 False negatives (FN) = 5 True negatives (TN) = 9² = 55 Sensitivity = TP/[TP+FN] = 62% Specificity = TN/[TN+FP] = 81% Recall Proportion of avoided non-interactions

Core plant clock model X LHY/ CCA1 TOC1 Y (GI) PRR9/ PRR7 Morning Evening Locke et al. Mol. Syst. Biol. 2006

Core plant clock model X LHY/ CCA1 TOC1 Y (GI) PRR9/ PRR7 Morning Evening Locke et al. Mol. Syst. Biol Yes

Non-stationarity in the regulatory process

Non-stationarity in the network structure

Flexible network structure.

Flexible network structure with regularization

ICML 2010

Morphogenesis in Drosophila melanogaster Gene expression measurements over 66 time steps of 4028 genes (Arbeitman et al., Science, 2002). Selection of 11 genes involved in muscle development. Zhao et al. (2006), Bioinformatics 22

Transition probabilities: flexible structure with regularization Morphogenetic transitions: Embryo  larva larva  pupa pupa  adult

Overview Introduction Methodology Circadian regulation in Arabidopsis Application to synthetic biology DREAM

Can we learn the switch Galactose  Glucose? Can we learn the network structure?

NIPS 2010

Node 1 Node i Node p Hierarchical Bayesian model Segment H

Exponential versus binomial prior distribution Exploration of various information sharing options

Task 1: Changepoint detection Switch of the carbon source: Galactose  Glucose

Task 2: Network reconstruction Precision Proportion of identified interactions that are correct Recall Proportion of true interactions that we successfully recovered

BANJO: Conventional homogeneous DBN TSNI: Method based on differential equations Inference: optimization, “best” network

Sample of high-scoring networks

Marginal posterior probabilities of the edges P=1 P=0 P=0.5

Keep interactions with a posterior probability > 0.5 Better evaluation: Consider all possible thresholds  Precision-recall curves

P=1 P=0 P=0.5 True network Thresh TP FP FN Prec Recall Precision= TP/(TP+FP) Recall= TP/(TP+FN)

P=1 P=0 P=0.5 True network Thresh0.9 TP1 FP0 FN1 Prec1 Recall1/2 Precision= TP/(TP+FP) Recall= TP/(TP+FN)

P=1 P=0 P=0.5 True network Thresh TP12 FP11 FN10 Prec12/3 Recall1/21 Precision= TP/(TP+FP) Recall= TP/(TP+FN)

P=1 P=0 P=0.5 True network Thresh TP122 FP112 FN100 Prec12/31/2 Recall1/211 Precision= TP/(TP+FP) Recall= TP/(TP+FN)

Galactose

Glucose

PriorCouplingAverage AUC None 0.70 ExponentialHard0.77 BinomialHard0.75 BinomialSoft0.75 Average performance over both phases: Galactose and glucose

How are we getting from here …

… to there ?!

Overview Introduction Methodology Circadian regulation in Arabidopsis Application to synthetic biology DREAM

DREAM: Dialogue for Reverse Engineering Assessments and Methods International network reconstruction competition: June-Sept 2010 Network# Transcription Factors # Genes# Chips Network 1 (in silico) Network Network Network

Marco Grzegorczyk University of Dortmund Germany Frank Dondelinger BioSS / University of Edinburgh United Kingdom Sophie Lèbre Université de Strasbourg France Our team Andrej Aderhold BioSS / University of St Andrews United Kingdom

Our model: Developed for time series Data: Different experimental conditions, perturbations (e.g. ligand injection), interventions (e.g. gene knock-out, overexpression), time points

Change-point process Free allocation

Our model: Developed for time series Data: Different experimental conditions, perturbations (e.g. ligand injection), interventions (e.g. gene knock-out, overexpression), time points To limit computational complexity: Stick to a changepoint process How do we get an ordering of the genes?

PCA

SOM

No time series  Use 1-dim SOM to get a chip order

Ordering of chips  changepoint model

Slow MCMC convergence Network# Transcription Factors # Genes# Chips Network 1 (in silico) Network Network Network

Problems with MCMC convergence Network# Transcription Factors # Genes# Chips Network 1 (in silico) Network Network Network PNAS 2009

Problems with MCMC convergence Network# Transcription Factors # Genes# Chips Network 1 (in silico) Network Network Network PNAS 2009

Methods competing in the competition Area under the precision-recall curve

Room for improvement: Higher-dimensional changepoint process Perturbations Experimental conditions

Marco Grzegorczyk University of Dortmund Germany Frank Dondelinger BioSS / University of Edinburgh United Kingdom Sophie Lèbre Université de Strasbourg France Acknowledgements Andrej Aderhold BioSS / University of St Andrews United Kingdom