Constrained graph structure learning by integrating diverse data types

Constrained graph structure learning by integrating diverse data types
Sushmita Roy Computational Network Biology Biostatistics & Medical Informatics 826 Computer Sciences 838 Sep 27th 2016

Goals for this lecture Different types of integrative inference frameworks Supervised A Naïve Bayes Classification approach Unsupervised Physical Module Networks (PMNs) N. Novershtern, A. Regev, and N. Friedman, "Physical module networks: an integrative approach for reconstructing transcription regulation," Bioinformatics, vol. 27, no. 13, pp. i177-i185, Jul Application of PMNs to real data

Why constrained structure learning?
Learning genome-scale networks is computationally challenging The space of possible graphs is huge There is not sufficient amount of training examples to learn these networks reliably Multiple equivalent models can be learned One type of data (expression) might not inform us of all the regulatory edges

RECAP: Different types of networks
Physical networks Transcriptional regulatory networks: interactions between regulatory proteins (transcription factors) and genes Protein-protein: interactions among proteins Signaling networks: protein-protein and protein-small molecule interactions to relay signals from outside the cell to the nucleus Functional networks Metabolic: reactions through which enzymes convert substrates to products Genetic: interactions among genes which when perturbed together produce a significant phenotype than when individually perturbed

Types of integrative inference frameworks
Supervised learning Require examples of interaction and non-interactions Train a classifier based on edge-specific features Unsupervised learning Edge aggregation Model-based learning Auxiliary datasets serve to provide priors on the graph structure

Supervised learning for integrative network inference

A few supervised learning approaches
Functional networks MouseNET (Y. Guan, C. L. Myers et al., "A genomewide functional network for the laboratory mouse," PLoS Comput Biol, vol. 4, no. 9, pp. e1 000 165+, Sep. 2008) STRING (L. J. Jensen, M. Kuhn, et al, "STRING 8-a global view on proteins and their functional interactions in 630 organisms." Nucleic acids research, vol. 37, no. Database issue, pp. D412-D416, Jan. 2009) Regulatory networks D. Marbach, S. Roy, F. Ay et al., "Predictive regulatory models in drosophila melanogaster by integrative inference of transcriptional networks," Genome Research, vol. 22, no. 7, pp , Jul. 2012 F. Mordelet and J.-P. Vert, "SIRENE: supervised inference of regulatory networks," Bioinformatics, vol. 24, no. 16, pp. i76-i82, Aug

Key points of supervised learning approaches
Ground truth for training a classifier or computing benchmarking scores Different datasets are represented as features of a pair of genes/proteins Largely applied for functional network inference and less for regulatory networks

Supervised learning of interactions Prob. of non-interacting?
X1 Y2 I12 = 1 if X1 interacts with Y2 0 otherwise Define: X1Y2.features: Attributes of X1 and Y2 Given: Prob. of interaction: P(I12=1|X1Y2.features) We need: Prob. of no interaction: P(I12=0|X1Y2.features) Prob. of interacting > Prob. of non-interacting? X1 Y2 I12=0 No I12=1 X1 Y2 Yes

Supervised learning of interactions
B C D E F …. Positive examples …. G H I J K L Negative examples FEATURE SET Feature extraction E A G L ? Testing Predicted edges TRAINED CLASSIFIER Training

MouseNET: Inferring functional networks by supervised integration diverse datasets
Goal: Predict functional interactions between pairs of genes based on diverse data sets Gold standard: Positive set Hand-curated pairs of proteins that are known to be involved in the same function Negative set Pairs of proteins with functional annotation but do not share annotations Diverse datasets to represent noisy observations of edges: Physical interaction databases Co-association with different diseases Transferred interactions from orthologous pairs of yeast proteins Co-expression and co-tissue localization Based on a probabilistic framework for data integration Classification algorithm Naïve Bayes Classifier Y. Guan, C. L. Myers et al., "A genomewide functional network for the laboratory mouse," PLoS Comput Biol, vol. 4, no. 9, pp. e1 000 165+, Sep. 2008

Naïve Bayes classifier to integrate different datasets
Let FR denote the random variable for an interaction Let E1.. En represent the evidence from different databases for the edge Probabilistic data integration treats each of the evidences as noisy observations for the edge FR E1 E2 En … Learning entails estimating the conditional distributions Naïve Bayes assumption: assume independence among evidences given the class variable

MouseNet: A Naïve Bayes classification approach to infer a functional network
Different types of datasets that contribute to P(Ei|FR)

MouseNET recovers functional relationships between mouse proteins
Data integration helps!

Classes of methods for integrative unsupervised network inference
Two approaches Weighted Edge aggregation Constrained model-based learning Weighted aggregation of different networks D. Marbach, S. Roy, F. Ay et al., "Predictive regulatory models in drosophila melanogaster by integrative inference of transcriptional networks," Genome Research, vol. 22, no. 7, pp , Jul. 2012 Model-based learning Auxiliary datasets serve to impose constraints on the graph structure We will look at three approaches to integrate other types of data to better learn regulatory networks Physical Module Networks (Sep 27th) Bayesian network structure prior distributions (Oct 3rd, 4th,6th) Dependency network parameter prior distributions (Oct 6th, 11th)

Strengths and weaknesses of different integrative inference paradigms
Supervised Evaluation is straightforward Leverage ground truth directly Easy to integrate different data sources/clear optimization function Need ground truth for training Negative examples are usually not available Typically do not predict expression of a target gene Unsupervised Do not need ground truth Broadly applicable and flexible with data sources Difficult to evaluate Typically do not perform as well as supervised learning when ground truth is known Learning/Setting hyper-parameters is challenging

Goals for this lecture Different types of integrative inference frameworks Supervised Unsupervised Physical Module Networks (PMNs) N. Novershtern, A. Regev, and N. Friedman, "Physical module networks: an integrative approach for reconstructing transcription regulation," Bioinformatics, vol. 27, no. 13, pp. i177-i185, Jul Application of PMNs to real data

Motivation for Physical Module Networks
Three main approaches to build a transcriptional regulatory network Observational models (e.g. Bayesian networks) Fail to distinguish true regulation from co-expression Perturbational models (e.g. knockout) Fail to distinguish direct from indirect targets) Physical models (TF binding) Fail to distinguish functional from non-functional binding Challenge Build a realistic model of gene regulation Combine changes in gene expression with the underlying physical interactions. Recall the distinction between physical and functional edges

Types of data for used in Physical Module networks
Samples Expression data Genome-wide mRNA levels from multiple microarray or RNA-seq experiments Physical interactions Transcription Factor-Gene interaction ChIP-chip and ChIP-seq Sequence specific motifs Protein-protein interactions expression motif ChIP Gene There are different types of datasets that can be used to infer the regulatory edges. Each with different types constraints. For example physical datasets Y X X Y

Approach Formulate a probabilistic graphical model called Physical Module Network (PMN) Two components of the model: Module network (M ) Physical interaction graph (I )

A Physical Module Network
Fig1. PMNs physical module. Expression pattern of the Mob1 regulator (top) and its 37 target genes (bottom), during 2 cell cycles (two replicates are shown). A physical pathway connecting Mob1 via Cdc28 to the transcription factor FKH2 that binds 15 of the module genes. Module Regulation program

Module Network RECAP Segal et al, 2005 Key assumptions
Genes are co-expressed in modules Genes in the same module have the same regulators Expression of a gene is predictable by the expression of the regulators (made for all expression-based network inference methods) Module networks are made up of module assignments Graph structure specifying parents of each module Conditional probability distributions Parents of module Mj

Physical Interaction Graph I
A graph between genes and proteins Three types of edges Protein-protein interactions Protein-DNA interactions (TF binding) Transcriptional edges connecting genes to its protein product The graph may have nodes that are not measured by expression

Consistency between M and I
•  An MN is consistent with an interaction graph if for each pair of regulator Xi and target module Mj, there is a consistent physical Regulation Path from Xi to Mj. A Regulation Path explains how the “state” of the regulator reaches a particular target module through a set of physical interactions. •  Formally, a Regulation Path is –  a sequence of nodes ⟨v1,...,vn⟩ in I, where v1 is a protein node of the protein product of Xi and vn is a transcription factor (TF) that binds all the genes in Mj. –  partially directed such that edge between vl and vl+1 is undirected or partially directed

Example of consistency
A regulation path is needed for consistency between a Module Network and a Physical interaction graph.

Learning a PMN Learning procedure. Input: gene expression, a set of potential regulators and physical protein–protein and protein–DNA interactions. PMN Learner: an iterative optimization procedure that finds modules, their regulators and the physical pathways that explain the regulation. Output: the best configuration found by the learner

Learning in PMN Similar to Module Networks
Optimize regulatory program per module Update module assignments But need to update I as well Need to check that I and M are consistent Change I to make sure it is consistent with M Assess the score due to change in I

PMN Learning algorithm
Given Input gene expression data DX Observations of physical interactions DI Pool of potential regulators Find a PMN that best describes our data Use a score-based learning framework similar to MNs Iterative algorithm Optimize regulatory program per module (improve the quality of gene expression prediction) Results in modification to the physical interaction graph Update module assignments Check for consistency in physical network and the MN

Scoring a PMN • Score of a PMN P=<M,I> is
•  Score decomposes into the Module Network (M) and Interaction graph (I) part provided they are consistent with each other From Segal et al, 2005 This paper

A little bit of notation for
: An indicator variable set to 1 if edge e appears in

Scoring the interacting graph
Score of the empty graph: constant, that we will ignore Assuming edges are independent, the first and third terms can be re-written as Prior probabilities

Defining the edge probabilities
Prior probability of edge present in Probability of observing an edge in Adding an edge is more costly than not : P-value associated with e. P(de=1|Ie=1) will be high when pe is small

Consistency check and updating the interaction graph
Module network learning entails Adding an edge from regulator R to module Mj Removing an edge from regulator R to module Mj Reassigning genes to a module Each time, we need to change to make sure it is consistent and evaluate the effect on the score

Updating interaction graph I when adding an edge
When adding an edge from regulator R to module Mj, check if there is a consistent regulatory path from R to Mj If there is none, consider adding TF-DNA edges to introduce such paths For each TF T, search for the heaviest (shortest) path from regulator R to T Add the cost of edges from T to genes in the module Select T that maximizes this score (sum of shortest path and the sum of all edges) If addition of TF-DNA edges does not improve score, do not add R to Mj.

Updating interaction graph I when adding an edge
Shortest Path Consider new edges in I X Add R to Mj ? A TF T T g1 g2 TF-DNA interactions g1 g2 g1 g2 Mj Can R be T? Current physical interaction graph I Potential changes to the physical interaction graph: must account for the cost of the shortest path and new TF-DNA interactions Module network

Updating interaction graph I when removing an edge
When removing an edge from regulator R to module Mj, remove edges in I while maintaining consistency Examine all edges from R to Mj and remove all edges that do not violate consistency

Updating interaction graph I during module re-assignment
For a gene g, being considered to be moved from Mj to Mk this would entail Removing TF-DNA edges for g in Mj Adding TF-DNA edges for g in Mk

Summary of PMN learning
Uses the same structure as the MN learning Each move in MN has some additional book-keeping for ensuring a consistent physical interaction graph

Experiments on simulated data
•  312 genes –  7 modules regulated by 10 genes •  Sample physical networks to exhibit similar properties as experimentally determined physical networks –  That is node degree, edge density are similar to those measured experimentally –  Select 7 TF proteins as true TFs associated with module •  Learn using 200 gene expression samples Evaluate using –  Likelihood on test data –  Accurate connection of regulators to genes –  Accurate inference of regulation path (only for PMN)

On simulated data PMNs have higher likelihood, and higher precision
(a) (b) Number of modules Number of modules (c) (d) Fig. 3. Performance on synthetic data. (a) Log likelihood of test samples, achieved by PMN (solid line) and MN (dashed line) as a function of module number. Plots show average over 10-fold cross validation experiments; error bars show 2 STD. (b) Precision rate of reconstructing regulator-target pairs, achieved by PMN and MN, as a function of number of modules. Plots as in (a). (c) Precision rate of reconstructing regulator-target pairs, achieved by PMN and MN, as a function of smoothness (noise) of the expression data. Plots as in (a). (d) Reconstructed pathways as a function of noise in the protein–DNA data. Plots show average precision and recall over 10-fold cross validation; Noise in distributions Number of bound targets per module (PMN only)

Goals for this lecture Different types of integrative inference frameworks Supervised Unsupervised Physical Module Networks (PMNs) N. Novershtern, A. Regev, and N. Friedman, "Physical module networks: an integrative approach for reconstructing transcription regulation," Bioinformatics, vol. 27, no. 13, pp. i177-i185, Jul Application of PMNs to real data

Evaluation on real data: yeast
•  Two expression datasets –  Assess the ability to recover “known” regulator-‐module relationships based on regulator perturbation followed by mRNA measurements –  Yeast cell cycle •  Physical interactions for 5,640 genes –  Protein protein interactions: ~18K –  Protein-DNA interactions: ~91K

G1 and S phase induced module
Dataset description: 50 time points, 594 cycling genes, Protein-DNA interactions specifically from the cell cycle PMN major results: PMN learned 36 modules, 11 had 1 regulator, 4 had two regulators. Regulation path length ~2.5 Modules differ based on which phase of cell cycle they peak The above module is one module associated TFs are chosen as regulators only in a few modules Fig. 5. Yeast cell cycle map. Pathways reconstructed by PMN from the yeast cell cycle data. Left - An example module (# 36), induced in G1 & S, is enriched for DNA replication and telomerase maintenance genes, and is regulated by ACM1 and SWI4. Right - all other pathways reconstructed. Modules (rectangles) are colored according to their peak activity phase, and proteins are colored according to the phase in which they are known to play a role.

PMN analysis of human host response to inﬂuenza infection
Novel insight: New mechanistic pathways from viral proteins to known major immune response regulators (NFKB1, E2F1, IRF1) Novel insight: Viral polymerase proteins act upon host signaling pathway through several apoptosis pathway proteins (TRAF1, API1 etc) Goal: use PMNs to see how viral proteins affect downstream gene expression. Fig. 6. Viral infection in human host. Pathways reconstructed from H1N1 influenza virus proteins to responsive gene clusters. (a) Pathway connecting NP and PB2 viral proteins to cluster 2. (b) Pathway connecting NA viral protein to clusters 4 and 5 and HA viral protein to clusters 4 and 7. Color indicates protein categories (see legend). Dataset description: 10 time points, protein-protein interactions between human host and 10 viral proteins. Human-protein interactions from various sources, but including only 32 human TFs Used 12 predeﬁned modules. Connect 10 viral proteins to modules.

PMN key points A per-module probabilistic graphical model based approach Regulatory program is learned while checking for support in the physical network Learn a mechanistic program (we will see other ways to do this in later lectures) Checking for consistency in the physical network adds to additional computational complexity Dependent upon the accuracy and completeness of the physical network

PMNs vs MNs What are the advantages of module networks compared to physical module networks? Enable a regulator to be selected based on expression and a physical path Provides a more detailed picture of the regulatory network What are the challenges in using PMNs? Need less noisy physical interaction graphs Application to mammalian systems required additional pre-processing Likely not as scalable as module networks

Constrained graph structure learning by integrating diverse data types

Similar presentations

Presentation on theme: "Constrained graph structure learning by integrating diverse data types"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Constrained graph structure learning by integrating diverse data types

Similar presentations

Presentation on theme: "Constrained graph structure learning by integrating diverse data types"— Presentation transcript:

Similar presentations

About project

Feedback