Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland
Overview Introduction Application to synthetic biology Lessons from DREAM
Network reconstruction from postgenomic data
Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks
Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks
direct interaction common regulator indirect interaction co-regulation Pairwise associations do not take the context of the systeminto consideration Shortcomings
Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks
Conditional Independence Graphs (CIGs) Direct interaction Partial correlation, i.e. correlation conditional on all other domain variables Corr(X 1,X 2 |X 3,…,X n ) Problem: #observations < #variables Covariance matrix is singular strong partial correlation π 12 Inverse of the covariance matrix
Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks
Model Parameters q Probability theory Likelihood
1) Practical problem: numerical optimization q 2) Conceptual problem: overfitting ML estimate increases on increasing the network complexity
Overfitting problem True pathway Poorer fit to the data Equal or better fit to the data
Regularization E.g.: Bayesian information criterion Maximum likelihood parameters Number of parameters Number of data points Data misfit term Regularization term
Complexity LikelihoodBIC
Model selection: find the best pathway Select the model with the highest posterior probability: This requires an integration over the whole parameter space:
Problem: huge computational costs q
Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks
Friedman et al. (2000), J. Comp. Biol. 7, Marriage between graph theory and probability theory
Bayes net ODE model
Model Parameters q Bayesian networks: integral analytically tractable!
UAI 1994
[A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise Linearity assumption A P1 P2 P4 P3 w1 w4 w2 w3
t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Homogeneity assumption
Accuracy Computational complexity Methods based on correlation and mutual information Conditional independence graphs Mechanistic models Bayesian networks
Example: 4 genes, 10 time points t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10
t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Standard dynamic Bayesian network: homogeneous model
Limitations of the homogeneity assumption
Our new model: heterogeneous dynamic Bayesian network. Here: 2 components t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10
t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Our new model: heterogeneous dynamic Bayesian network. Here: 3 components
Learning with MCMC q k h Number of components (here: 3) Allocation vector
Learning with MCMC q k h Number of components (here: 3) Allocation vector
Non-homogeneous model Non-linear model
[A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise BGe: Linear model A P1 P2 P4 P3 w1 w4 w2 w3
Can we get an approximate nonlinear model without data discretization? y x
Idea: piecewise linear model y x
t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Inhomogeneous dynamic Bayesian network with common changepoints
Inhomogenous dynamic Bayesian network with node-specific changepoints t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10
NIPS 2009
Non-stationarity in the regulatory process
Non-stationarity in the network structure
Flexible network structure.
Flexible network structure with regularization
ICML 2010
Morphogenesis in Drosophila melanogaster Gene expression measurements over 66 time steps of 4028 genes (Arbeitman et al., Science, 2002). Selection of 11 genes involved in muscle development. Zhao et al. (2006), Bioinformatics 22
Transition probabilities: flexible structure with regularization Morphogenetic transitions: Embryo larva larva pupa pupa adult
Overview Introduction Application to synthetic biology Lessons from DREAM
Can we learn the switch Galactose Glucose? Can we learn the network structure?
NIPS 2010
Node 1 Node i Node p Hierarchical Bayesian model
Node 1 Node i Node p Hierarchical Bayesian model
Exponential versus binomial prior distribution Exploration of various information sharing options
Task 1: Changepoint detection Switch of the carbon source: Galactose Glucose
Task 2: Network reconstruction Precision Proportion of identified interactions that are correct Recall Proportion of true interactions that we successfully recovered
BANJO: Conventional homogeneous DBN TSNI: Method based on differential equations Inference: optimization, “best” network
Sample of high-scoring networks
Feature extraction, e.g. marginal posterior probabilities of the edges
Galactose
Glucose
PriorCouplingAverage AUC None 0.70 ExponentialHard0.77 BinomialHard0.75 BinomialSoft0.75 Average performance over both phases: Galactose and glucose
How are we getting from here …
… to there ?!
Overview Introduction Application to synthetic biology Lessons from DREAM
DREAM: Dialogue for Reverse Engineering Assessments and Methods International network reconstruction competition: June-Sept 2010 Network# Transcription Factors # Genes# Chips Network 1 (in silico) Network Network Network
Marco Grzegorczyk University of Dortmund Germany Frank Dondelinger BioSS / University of Edinburgh United Kingdom Sophie Lèbre Université de Strasbourg France Our team Andrej Aderhold BioSS / University of St Andrews United Kingdom
Our model: Developed for time series Data: Different experimental conditions, perturbations (e.g. ligand injection), interventions (e.g. gene knock-out, overexpression), time points How do we get an ordering of the genes?
PCA
SOM
No time series Use 1-dim SOM to get a chip order
Ordering of chips changepoint model
Problems with MCMC convergence Network# Transcription Factors # Genes# Chips Network 1 (in silico) Network Network Network
Problems with MCMC convergence Network# Transcription Factors # Genes# Chips Network 1 (in silico) Network Network Network PNAS 2009
[A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise Linear model A P1 P2 P4 P3 w1 w4 w2 w3
L1 regularized linear regression
Problems with MCMC convergence Network# Transcription Factors # Genes# Chips Network 1 (in silico) Network Network Network
Problems with MCMC convergence Network# Transcription Factors # Genes# Chips Network 1 (in silico) Network Network Network
Assessment Participants Had to submit rankings of all interactions Organisers Computed areas under 1)Precision-recall curves 2)ROC curves (plotting sensitivity=recall against specificity)
Uncertainty about the best network structure Limited number of experimental replications, high noise
Sample of high-scoring networks
Feature extraction, e.g. marginal posterior probabilities of the edges
Sample of high-scoring networks Feature extraction, e.g. marginal posterior probabilities of the edges High-confident edge High-confident non-edge Uncertainty about edges
ROC curves True positive rate Sensitivity False positive rate Complementary specificity
Definition of metrics Total number of true edges Total number of predicted edges Total number of non-edges Total number of true edges
The relation between Precision-Recall (PR) and ROC curves
Better performance
Assessment Participants Had to submit rankings of all interactions Organisers Computed areas under 1)Precision-recall curves 2)ROC curves (plotting sensitivity=recall against specificity)
Proportion of recovered true edges Proportion of avoided non-edges AUROC = 0.5
Joint work with Wolfgang Lehrach on ab initio prediction of protein interactions AUROC= 0.61,0.67,0.67
ICML 2006
The relation between Precision-Recall (PR) and ROC curves Better performance
Potential advantage of Precision-Recall (PR) over ROC curves Large number of negative examples (TN+FP) Large change in FP may have a small effect on the false positive rate Large change in FP has a strong effect on the precision Small difference Large difference
Room for improvement: Higher-dimensional changepoint process Perturbations Experimental conditions