Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mechanistic models and machine learning methods for TIMET

Similar presentations


Presentation on theme: "Mechanistic models and machine learning methods for TIMET"— Presentation transcript:

1 Mechanistic models and machine learning methods for TIMET
Dirk Husmeier

2 Protein signalling pathway
Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005

3 Can we learn the signalling pathway from data?
Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005

4 High-throughput experiments
Network unknown High-throughput experiments Postgenomic data Machine learning Statistics

5 Methodology Workpackages Mechanistic models Machine learning methods
WP1.7: Re-calibrate the circadian clock model for mature plants growing without exogeneous sugars. WP 2.4: Bi-directional regulation: Mechanistic modelling of each metabolic pathway, with connections to the clock. WP 2.5: Bi-directional regulation: Testing predictions of bidirectional models.

6 Methodology Mechanistic models Bayesian networks
Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes

7 Regulatory network

8 Elementary molecular biological processes

9 Description with differential equations

10 Description with differential equations

11 Concentrations Rates Kinetic parameters q

12 Description with differential equations
Concentrations Kinetic parameters q Rates

13 Parameters q known: Numerically integrate the differential equations for different hypothetical networks

14 Experiment: Gene expression time series
Can we infer the correct gene regulatory network?

15 Model selection for known parameters q
Gene expression time series predicted with different models Measured gene expression time series Compare Highest likelihood: best model

16 Model selection for unknown parameters q
Gene expression time series predicted with different models Measured gene expression time series Joint maximum likelihood:

17 q 1) Practical problem: numerical optimization
2) Conceptual problem: overfitting ML estimate increases on increasing the network complexity

18 Maximum likelihood parameters
Regularization E.g.: BIC Regularization term Data misfit term Maximum likelihood parameters Number of parameters Number of data points

19 Model selection: find the best pathway
Select the model with the highest posterior probability: This requires an integration over the whole parameter space:

20 Model selection: find the best pathway
Select the model with the highest posterior probability: This requires an integration over the whole parameter space: This integral is usually analytically intractable

21 Complexity problem This requires an integration over the whole parameter space: q The numerical approximation is highly non-trivial

22

23 Illustration of annealed importance sampling
Posterior distribution Taken from the MSc thesis by Ben Calderhead, Prior distribution

24 Outer loop: Annealing scheme Centre loop: MCMC Inner loop: Numerical solution of differential equations

25 Computational expensive, network reconstruction ab initio unfeasible
Marginal likelihoods for the alternative pathways Computational expensive, network reconstruction ab initio unfeasible

26 Outer loop: Annealing scheme Centre loop: MCMC Inner loop: Numerical solution of differential equations

27 NIPS 2008

28 Objective: Reconstruction of regulatory networks ab initio
Higher level of abstraction: Bayesian networks

29 Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian networks for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

30 Marriage between graph theory and probability theory
Friedman et al. (2000), J. Comp. Biol. 7,

31 Bayes net ODE model

32 Bayesian networks Marriage between graph theory and probability theory. Directed acyclic graph (DAG) representing conditional independence relations. It is possible to score a network in light of the data: P(D|M), D:data, M: network structure. We can infer how well a particular network explains the observed data. NODES A B C EDGES D E F

33 [A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise
Linear model [A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise P1 w1 P2 A w2 w3 P3 w4 P4

34 Nonlinear discretized model
P P1 Activator P2 Activation Repressor Allow for noise: probabilities P P1 Activator P2 Inhibition Conditional multinomial distribution Repressor

35 Integral analytically tractable!
Model Parameters q Integral analytically tractable!

36

37 Example: 2 genes  16 different network structures
Best network: maximum score

38 Identify the best network structure
Ideal scenario: Large data sets, low noise

39 Uncertainty about the best network structure
Limited number of experimental replications, high noise

40 Sample of high-scoring networks

41 Sample of high-scoring networks
Feature extraction, e.g. marginal posterior probabilities of the edges

42 Sample of high-scoring networks
Feature extraction, e.g. marginal posterior probabilities of the edges Uncertainty about edges High-confident edge High-confident non-edge

43 Can we generalize this scheme to more than 2 genes?
In principle yes. However …

44 Number of structures Number of nodes

45 Sampling from the posterior distribution
Find the high-scoring structures Configuration space of network structures

46 MCMC Local change If accept If
accept with probability Configuration space of network structures

47 Madigan & York (1995), Guidici & Castello (2003)

48 Problem: Local changes  small steps  slow convergence, difficult to cross valleys.
Configuration space of network structures

49 Problem: Global changes  large steps  low acceptance  slow convergence.
Configuration space of network structures

50 Can we make global changes that jump onto other peaks and are likely to be accepted?
Configuration space of network structures

51

52 Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene network in Arabidopsis thaliana Current work

53 This requires an integration over the whole parameter space:
Bayesian inference Select the model based on the posterior probability: This requires an integration over the whole parameter space:

54 Uncertainty about the best network structure
Limited number of experimental replications, high noise

55 Reduced uncertainty by using prior knowledge
Data Prior knowledge

56 Bayesian analysis: integration of prior knowledge β
Hyperparameter β trades off data versus prior knowledge Microarray data KEGG pathway

57 Hyperparameter β trades off data versus prior knowledge
β small Microarray data KEGG pathway

58 Hyperparameter β trades off data versus prior knowledge β large
Microarray data KEGG pathway

59 Input: Learn: MCMC

60

61 Raf signalling pathway
Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005

62 Flow cytometry data Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins 5400 cells have been measured under 9 different cellular conditions (cues) Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments

63 Prior knowledge from KEGG
0.87 1 0.71 1 0.25 0.5 0.5 0.5 0.5 Data: protein concentrations from flow cytometry experiments

64 Protein signalling network from the literature

65 Predicted network 11 nodes, 20 edges, 90 non-edges
20 top-scoring edges: /20 correct 5/90 false 75% 94%

66 Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

67

68 Dynamic Bayesian network

69 Example: 4 genes, 10 time points

70 Standard dynamic Bayesian network: homogeneous model
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

71 Our new model: heterogeneous dynamic Bayesian network
Our new model: heterogeneous dynamic Bayesian network. Here: 2 components t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

72 Our new model: heterogeneous dynamic Bayesian network
Our new model: heterogeneous dynamic Bayesian network. Here: 3 components t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

73 Learning with MCMC q h k Allocation vector
Number of components (here: 3)

74

75 Morphogenesis in Drosophila melanogaster
Gene expression measurements over 66 time steps of 4028 genes (Arbeitman et al., Science, 2002). Selection of 11 genes involved in muscle development. Zhao et al. (2006), Bioinformatics 22

76 Heterogeneous dynamic Bayesian network: Plausible segmentation?
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

77 Number of components

78 Number of components Four stages of the Drosophila life cycle: embryo  larva  pupa  adult

79 time

80 time Morphogenetic transitions: Embryo  larva larva pupa pupa  adult Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa.

81 Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

82 Circadian rhythms in Arabidopsis thaliana
Collaboration with the Institute of Molecular Plant Sciences at Edinburgh University (Andrew Miller’s group) 2 time series T20 and T28 of microarray gene expression data from Arabidopsis thaliana. - Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 - Both time series measured under constant light condition at 13 time points: 0h, 2h,…, 24h, 26h - Plants entrained with different light:dark cycles 10h:10h (T20) and 14h:14h (T28)

83 Gene expression time series plots (Arabidopsis data T20 and T28)

84 Predicted network - medium = PP>0.75 - fat = PP>0.9
Blue – activation Red – inhibition Black – mixture Three different line widths: - thin = PP>0.5 - medium = PP>0.75 - fat = PP>0.9

85 Cogs of the Plant Clockwork
Review – Rob McClung, Plant Cell 2006 Two major gene classes… Morning genes e.g. LHY, CCA1 … repress evening genes e.g. TOC1, ELF3, ELF4, GI, LUX … which activate LHY and CCA1

86 Literature vs. inferred network
ELF3 CCA1 LHY PRR9 GI We expect direct inhibition of several evening genes by Lhy/CCa1, here moved together for clarity. Interestingly, some of these were learned, but all arose from CCA1 not LHY – instead, the BGM highlights a sequence of positive or mixed links: Lhy-CCA1-PRR9-GI-TOC1/ELF3/PRR5-ELF4 TOC1 PRR5 PRR3 ELF4 False negatives False positives 86

87 True positives (TP) = 8 False positives (FP) = 13 False negatives (FN) = 5 True negatives (TN) = ² = 55 Sensitivity = TP/[TP+FN] = 62% Specificity = TN/[TN+FP] = 81%

88 Overview of the plant clock model
Morning Y (GI) Evening PRR9/ PRR7 LHY/ CCA1 TOC1 Locke et al. Mol. Syst. Biol. 2006 X ZTL Unknown component X allows > 8h delay between TOC1 and LHY/CCA1 expression This is the model as published, note some genes are merged for parsimony. There is data for each link except TOC1 inhibition of GI, and X is unknown TOC1-dependent component that activates LHY and CCA1. Given the long delay, it’s unlikely the model would learn the X link BUT there is a BGM link from ELF3 to CCA1 that is exactly as expected for X. 88

89 Literature vs. inferred network
ELF3 CCA1 LHY PRR9 GI We expect direct inhibition of several evening genes by Lhy/CCa1, here moved together for clarity. Interestingly, some of these were learned, but all arose from CCA1 not LHY – instead, the BGM highlights a sequence of positive or mixed links: Lhy-CCA1-PRR9-GI-TOC1/ELF3/PRR5-ELF4 TOC1 PRR5 PRR3 ELF4 False negatives False positives 89

90 Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

91 Flexible network structure with regularization Joint work with Sophie Lèbre and Frank Dondelinger

92 Drosophila melanogaster: Expression of 11 muscle development genes over 66 time points
Fixed structure, flexible parameters time Morphogenetic transitions: Embryo  larva larva pupa pupa  adult Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa.

93 Transition probabilities: flexible structure with regularization
Morphogenetic transitions: Embryo  larva larva pupa pupa  adult

94 Comparison with: Ahmed & Xing Dondelinger, Lèbre & Husmeier

95 Summary Mechanistic models Bayesian networks
Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes

96 Any questions? Thank you!


Download ppt "Mechanistic models and machine learning methods for TIMET"

Similar presentations


Ads by Google