Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identifying co-regulation using Probabilistic Relational Models by Christoforos Anagnostopoulos BA Mathematics, Cambridge University MSc Informatics, Edinburgh.

Similar presentations


Presentation on theme: "Identifying co-regulation using Probabilistic Relational Models by Christoforos Anagnostopoulos BA Mathematics, Cambridge University MSc Informatics, Edinburgh."— Presentation transcript:

1 Identifying co-regulation using Probabilistic Relational Models by Christoforos Anagnostopoulos BA Mathematics, Cambridge University MSc Informatics, Edinburgh University supervised by Dirk Husmeier

2 General Problematic Bringing together disparate data sources: Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC...

3 General Problematic Bringing together disparate data sources: Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC... mRNA Gene expression data gene 1: overexpressed gene 2: overexpressed...

4 General Problematic Bringing together disparate data sources: Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC... mRNA Gene expression data gene 1: overexpressed gene 2: overexpressed... Proteins Protein interaction data protein 1 protein 2 ORF 1 ORF AAC1 TIM10 YMR056C YHR005CA AAD6 YNL201C YFL056C YNL201C

5 Our data Promoter sequence data...ACGTTAAGCCAT......GGCATGAATCCC... mRNA Gene expression data gene 1: overexpressed gene 2: overexpressed...

6 Bayesian Modelling Framework Bayesian Networks

7 Bayesian Modelling Framework Bayesian Networks Conditional Independence Assumptions Factorisation of the Joint Probability Distribution UNIFIED TRAINING

8 Bayesian Modelling Framework Bayesian Networks Probabilistic Relational Models

9 Aims for this presentation: 1.Briefly present the Segal model and the main criticisms offered in the thesis 2.Briefly introduce PRMs 3.Outline directions for future work

10 The Segal Model Cluster genes into transcriptional modules... Module 1 gene Module 2 ?

11 The Segal Model Module 1 gene Module 2 P(M = 1)P(M = 2)

12 The Segal Model How to determine P(M = 1)? Module 1 gene P(M = 1)

13 The Segal Model How to determine P(M = 1)? Module 1 Motif Profile motif 3: active motif 4: very active motif 16: very active motif 29: slightly active gene

14 The Segal Model How to determine P(M = 1)? Module 1 Motif Profile motif 3: active motif 4: very active motif 16: very active motif 29: slightly active Predicted Expression Levels Array 1: overexpressed Array 2: overexpressed Array 3: underexpressed... gene

15 The Segal Model How to determine P(M = 1)? Module 1 Motif Profile motif 3: active motif 4: very active motif 16: very active motif 29: slightly active Predicted Expression Levels Array 1: overexpressed Array 2: overexpressed Array 3: underexpressed... gene P(M = 1)

16 The Segal model PROMOTER SEQUENCE

17 The Segal model PROMOTER SEQUENCE MOTIF PRESENCE

18 The Segal model PROMOTER SEQUENCE MOTIF PRESENCE MOTIF MODEL

19 The Segal model MOTIF PRESENCE MODULE ASSIGNMENT

20 The Segal model MOTIF PRESENCE MODULE ASSIGNMENT REGULATION MODEL

21 The Segal model MODULE ASSIGNMENT EXPRESSION DATA

22 The Segal model MODULE ASSIGNMENT EXPRESSION DATA EXPRESSION MODEL

23 Learning via hard EM HIDDEN

24 Learning via hard EM Initialise hidden variables

25 Learning via hard EM Initialise hidden variables Set parameters to Maximum Likelihood

26 Learning via hard EM Initialise hidden variables Set parameters to Maximum Likelihood Set hidden values to their most probable value given the parameters (hard EM)

27 Learning via hard EM Initialise hidden variables Set parameters to Maximum Likelihood Set hidden values to their most probable value given the parameters (hard EM)

28 Motif Model OBJECTIVE: Learn motif so as to discriminate between genes for which the Regulation variable is “on” and genes for which it is “off”. r = 1 r = 0

29 Motif Model – scoring scheme...CATTCC......TGACAA... high score: low score:

30 Motif Model – scoring scheme...CATTCC......TGACAA... high score: low score:...AGTCCATTCCGCCTCAAG... high scoring subsequences

31 Motif Model – scoring scheme...CATTCC......TGACAA... high score: low score:...AGTCCATTCCGCCTCAAG... high scoring subsequences low scoring (background) subsequences

32 Motif Model – scoring scheme...CATTCC......TGACAA... high score: low score:...AGTCCATTCCGCCTCAAG... high scoring subsequences low scoring (background) subsequences promoter sequence scoring

33 Motif Model SCORING SCHEME P ( g.r = true | g.S, w ) w: parameter set can be taken to represent motifs

34 Motif Model SCORING SCHEME P ( g.r = true | g.S, w ) w: parameter set can be taken to represent motifs Maximum Likelihood setting Most discriminatory motif

35 Motif Model – overfitting TRUE PSSM

36 Motif Model – overfitting typical motif:...TTT.CATTCC... high score TRUE PSSM

37 Motif Model – overfitting typical motif:...TTT.CATTCC... high score TRUE PSSM INFERRED PSSM Can triple the score!

38 Regulation Model For each module m and each motif i, we estimate the association u mi P ( g.M = m | g. R ) is proportional to

39 Regulation Model: Geometrical Interpretation The (u mi ) i define separating hyperplanes Classification criterion is the inner product: Each datapoint is given the label of the hyperplane it is the furthest away from, on its positive side.

40 Regulation Model: Divergence and Overfitting pairwise linear separability overconfident classification Method A: dampen the parameters (eg Gaussian prior) Method B: make the dataset linearly inseparable by augmentation

41 Erroneous interpretation of the parameters Segal et al claim that: When u mi = 0, motif i is inactive in module m When u mi > 0 for all i,m, then only the presence of motifs is significant, not their absence

42 Erroneous interpretation of the parameters Segal et al claim that: When u mi = 0, motif i is inactive in module m When u mi > 0 for all i,m, then only the presence of motifs is significant, not their absence Contradict normalisation conditions!

43 Sparsity TRUE PROCESS INFERRED PROCESS

44 Sparsity Sparsity can be understood as pruning Pruning can improve generalisation performance (deals with overfitting both by damping and by decreasing the degrees of freedom) Pruning ought not be seen as a combinatorial problem, but can be dealt with appropriate prior distributions Reconceptualise the problem:

45 Sparsity: the Laplacian How to prune using a prior: choose a prior with a simple discontinuity at the origin, so that the penalty term does not vanish near the origin every time a parameter crosses the origin, establish whether it will escape the origin or is trapped in Brownian motion around it if trapped, force both its gradient and value to 0 and freeze it Can actively look for nearby zeros to accelerate pruning rate

46 Results: generalisation performance Synthetic Dataset with 49 motifs, 20 modules and 1800 datapoints

47 Results: interpretability TRUE MODULE STRUCTURE DEFAULT MODEL: LEARNT WEIGHTS LAPLACIAN PRIOR MODEL: LEARNT WEIGHTS

48 Regrets: BIOLOGICAL DATA

49 Aims for this presentation: 1.Briefly present the Segal model and the main criticisms offered in the thesis 2.Briefly introduce PRMs 3.Outline directions for future work

50 Probabilistic Relational Models How to model context – specific regulation? Need to cluster the experiments...

51 Probabilistic Relational Models Variable A can vary with genes but not with experiments

52 Probabilistic Relational Models We now have variability with experiments but also with genes!

53 Probabilistic Relational Models Variability with experiments as required but too many dependencies

54 Probabilistic Relational Models Variability with experiments as required provided we constrain the parameters of the probability distributions P(E|A) to be equal

55 Probabilistic Relational Models Resulting BN is essentially UNIQUE. But derivation: VAGUE, COMPLICATED, UNSYSTEMATIC

56 Probabilistic Relational Models GENES g.S 1, g.S 2,... g.R 1, g.R 2,... g.M g.E 1, g.E 1,... this variable cannot be considered an attribute of a gene, because it has attributes of its own that are gene-independent

57 Probabilistic Relational Models GENES g.S 1, g.S 2,... g.R 1, g.R 2,... g.M g.E 1, g.E 1,...

58 Probabilistic Relational Models GENES g.S 1, g.S 2,... g.R 1, g.R 2,... g.M g.E 1, g.E 1,... EXPERIMENTS e.Cycle_Phase e.Dye_Type

59 Probabilistic Relational Models GENES g.S 1, g.S 2,... g.R 1, g.R 2,... g.M g.E 1, g.E 1,... EXPERIMENTS e.Cycle_Phase e.Dye_Type An expression measurement is an attribute of both a gene and an experiment.

60 Probabilistic Relational Models GENES g.S 1, g.S 2,... g.R 1, g.R 2,... g.M g.E 1, g.E 1,... EXPERIMENTS e.Cycle_Phase e.Dye_Type MEASUREMENTS m(e,g).Level

61 Examples of PRMs - 1 Segal et al, “From Promoter Sequence to Gene Expression”

62 Examples of PRMs – 1 Segal et al, “From Promoter Sequence to Gene Expression”

63 Examples of PRMs - 2 Segal et al, “Decomposing gene expression into cellular processes”

64 Examples of PRMs - 2 Segal et al, “Decomposing gene expression into cellular processes”

65 Probabilistic Relational Models PRM = { BN 1, BN 2, BN 3,... } given Dataset 1 PRM = BN 1 given Dataset 2 PRM = BN 2 Relational schema :higher level description of data PRM:higher level description of BNs

66 Probabilistic Relational Models Relational vs flat data structures: Natural generalisation – knowledge carries over Expandability Richer semantics – better interpretability No loss in coherence Personal opinion (not tested yet): Not entirely natural as a generalisation Some loss in interpretability Some loss in coherence

67 Aims for this presentation: 1.Briefly present the Segal model and the main criticisms offered in the thesis 2.Briefly introduce PRMs 3.Outline directions for future work

68 Future research 1.Improve the learning algorithm ‘soften’ it by exploiting sparsity systematise dynamic addition / deletion

69 Future research 2. Model Selection Techniques improve interpretability learn the optimal number of modules in our model

70 Future research 2. Model Selection Techniques improve interpretability learn the optimal number of modules in our model Are such methods consistent? Do they carry over just as well in PRMs?

71 Future research 3. Fine tune the Laplacian regulariser to fit the skewing of the model

72 Future research 4. The choice of encoding the question into a BN/PRM is only partly determined by the domain Are there any general ‘rules’ about how to restrict the choice so as to promoter interpretability?

73 Future research 5. Explore methods to express structural, nonquantifiable prior beliefs about the biological domain using Bayesian tools.

74 Summary: 1.Briefly presented the Segal model and the main observations offered in the thesis 2.Briefly introduced PRMs 3.Hinted towards directions for future work


Download ppt "Identifying co-regulation using Probabilistic Relational Models by Christoforos Anagnostopoulos BA Mathematics, Cambridge University MSc Informatics, Edinburgh."

Similar presentations


Ads by Google