Using Bayesian Networks to Analayze Expression Data Shelly Bar-Nahor
Today 1. Introduction to Bayesian Networks. 2. Describe a method for recovering gene interactions from microarray data, using tools for learning Bayesian Networks. 3. Apply the method to the S. cerevisiae cell-cycle measurements.
Bayesian Networks – Compact representation of probability distributions via conditional independence. The representation consists of two components: G - a directed acyclic graph (DAG) Nodes – Random Variables (X1, …,Xn ). Vertices – direct influence. 2. Θ – a set of conditional probability distributions. Together, these 2 components specify a unique distribution on (X1, …,Xn ). A C B P(C|A) c0 a0 a1 c1 0.95 0.05 0.1 0.9
The graph G encodes the markov assumption: Each Variable Xi is independent of it’s non-descendants given it’s parents in G. By applying the chain rull of probabilities - P(X1 ,…,Xn) = ∏ P(Xi | PaG(Xi)) P(Xi | PaG(Xi)) is the conditional distribution for each variable Xi. We denote the parameters that specify these distributions by Θ. n
To do so, we will use the ‘Likelihood Function’: The Learning Problem: Let m be the number of samples and n the number of variables. Given a training set D=(X1, …,Xm), where Xi = (Xi1, …,Xin) , find a network B=<G,Θ> that best matches D. Let’s assume we have the graph’s structure. We now want to estimate the parameter Θ given the data D. To do so, we will use the ‘Likelihood Function’: L(Θ:D) = P(D| Θ) = P(X1=x[1], … ,Xm=x[m]| Θ) = P(x[i] | Θ) ∏ m
Learning Parameters - Likelihood Function E[1] B[1] A[1] C[1] : : : : E[m] B[m] A[m] C[m] D = B E A C Assume i.i.d samples, the Likelihood functions is: L(Θ:D) = P(E[m],B[m],A[m],C[m] | Θ) ∏ m
Learning Parameters - Likelihood Function L(Θ:D) = P(E[m],B[m],A[m],C[m] : Θ) = ∏ m B E A C ∏ m P(E[m] : Θ) . P(B[m] : Θ) . P(A[m] | B[m],E[m] : Θ) . P(C[m] | A[m] : Θ) = P(E[m]:Θ) P(B[m]:Θ) P(A[m]|B[m],E[m]:Θ) . P(C[m]|A[m]:Θ) ∏ m
Learning Parameters - Likelihood Function General Bayesian Networks L(Θ:D) = P(X1=x[1], … ,Xm=x[m]| Θ) = P(Xi[m]|Pai[m]: Θi) = Li(Θi:D) Decomposition – Independent estimation problems. ∏ m ∏ ∏ ∏ i m i
MLE - maximum likelihood estimator Learning Parameters MLE - maximum likelihood estimator Bayesian Inference – Learning using bayes rule Represent uncertainty about the sampling using a bayesian network: Θ X[1] X[2] X[m] . . . . . . . The values of X are independent given Θ P(X[m] | Θ ) = Θ Bayesian prediction is inference in this network
Θ P(Θ|x[1],…,x[m]) = Bayesian prediction is inference in this network . . . . . . . X[m+1] Observed Data query Bayesian prediction is inference in this network P(x[m+1]| x[1],…,x[m]) = ∫ P(x[m+1]| Θ,x[1],…,x[m])P(Θ|x[1],…x[m])d Θ = ∫ P(x[m+1]| Θ)P(Θ|x[1],…,x[m])d Θ 1 1 Likelihood prior P(x[1],…,x[m]| Θ) P(Θ) P(Θ|x[1],…,x[m]) = P(x[1],…,x[m]) Bayes rule Probability of data
Equivalence Classes of Bayesian Netorks Problem: The joint probability represented by a graph can equaly be represented by another one. A C B P(x) = P(A)P(C|A)P(B|C) P(x) = P(C)P(A|C)P(B|C) P(C|A)P(A)= P(A|C)P(C) In the same way Ind(G) – set of independence statements that hold in all distributions the markov assumption. G and G’ are aquivalent if Ind(G) = Ind(G’)
Equivalence Classes of Bayesian Networks Two DAGs are equivalent if and only if they have the same underlying undirected graph and the same v-structure. We will represent an equivealence class of netwrok structures by a partially DAG (PDAG), where a directed edge denotes all members of the class contain the same directed arc, and an undirected edge otherwise.
Learning Causal Patterns We want to model the mechanism that generate the adependencies (e.g. gene transcriptions). A Causal Networks is a model of such causal processes. representation: a DAG where each node represents a random variable with a local probabilty model. The parents of a variable are its immediate cause.
Learning Causal Patterns Observations – a passive mesurement of out domain (I.e. a sample from X) Intervention – setting the values of some variable using forces outside the causal model . A causal Network models not only the distribution of the observation, but also the effects of interventions.
Learning Causal Patterns If X causes Y, then manipulating the value of X affect the value of Y. If Y couses X, then manipulating the value of X will not affect Y. X Y and Y X are equivalent Bayseian networks. Causal Markov Assumption: given the values of a variable’s immdeiate causes, it is independent of it’s earlier causes.
Learning Causal Patterns When making the causal Markov assumption a causal network can be interpreted as a Bayesian Network. From obeservations alone we cannot distinguish between causal networks that belong to the same equivalence class. From a directed edge in the PDAG we can infer a causal direction.
So Far… The likelihood Function Parameter estimation and the decomposition principle Equivalence Classes of Bayesian Netorks. Learning Causal Patterns
Analyzing Expression Data Present modeling assumptions Find high scoring networks. Characterize features. Estimate Statistical Confidence in features. Present local probability models.
Modeling assumptions We consider probability distributions over all possible states of the system. A state is discribed using random variables. Random Variables – expression level of individual gene expreimental conditions temporal indicators (time/stage the sample was taken). background variables (which clinical procedure was used to take the sample)
Find high scoring networks. Analyzing Expression Data Present modeling assumptions Find high scoring networks. Characterize features. Estimate Statistical Confidence in features. Present local probability models.
Priors over parameters The Lerning Problem: Given a training set D=(X1, …,XN) of independent instances X, find an equivalence class of networks B=<G, Θ> that best matches D. We will use a scoring system: S(G:D) = logP(G|D) = … = logP(D|G) + logP(G) + C P(D|G) = ∫P(D|G, Θ)P(Θ|G)d Θ Likelihood Priors over parameters Marginal Likelihood prior
The learning problem - Scoring, cont. Properties of the selected priors: structure equivalent decomposable S(G:D) = ∑ ScoreContribution(Xi,Pa(Xi)) : D) Now, learning amounts to finding structure G that maximize the score. This problem is NP-hard We resort to huristic search
Local Search Strategy Using decomposition, we can change one arc and evaluate the gains made by this change Initial structure G Neighboring structures G’ A B A A B B C C C If and arc to Xi is added or deleted, only score(Xi, Pa(Xi)) needs to be evaluated. If an arc is reversed only score(Xi, Pa(Xi)) and score(Xj, Pa(Xj)) need to be evaluated.
Find High-Scoring Networks Problem: Small data sets are not sufficiently informative to determine that a single model is the “right” model (or equivalence class of models). Solution: analyze a set of high-scoring networks. Attempt to characterize features that are common to most of these networks and focus on learning them.
Characterize features. Analyzing Expression Data Present modeling assumptions Find high scoring networks. Characterize features. Estimate Statistical Confidence in features. Present local probability models.
Features We will use two classes of features involving pairs of variables. Markov Relations – Is Y in the Markov Blanket of X? Y is in X’s Markov Blanket if and only if there is either an edge between them, or both are parents of another variable. A Markov Realation indicates that the two genes are related in some joint biological interaction or process.
Features 2. Order Realtions – Is X an ancestor of Y in all the networks of a given equivalence class? Does the PDAG contain a path from X to Y in which all the edges are directed? This is an indication that X might be a couse of Y.
Estimate Statistical Confidence in features. Analyzing Expression Data Present modeling assumptions Find high scoring networks. Characterize features. Estimate Statistical Confidence in features. Present local probability models.
Estimate Statistical Confidence in Features We want to estimate to what extent the data support a given feature. We use the Bootstrap Method: We generate “purturbed” versions of the original data, and learn from them. We should be more confident on features that would still be induced from the “purturbed” data.
Estimate Statistical Confidence in Features We use the Bootstrap as foolows: For i=1 … m (in our experiments m=200) Re-sample with replacement N instance of D. denote Di the resulting dataset. Apply the learning procedure on Di induce a network structure Gi. For each feature f of interest calculate: conf(f) = 1/m ∑f(Gi) f(G) is 1 if f is a feature in G, and 0 otherwise.
Estimate Statistical Confidence in Features Features induced with high confidence are rarely false positives. The bootstrap procedure is especially robust fot the Markov and order features. The conclusions that can be established on high confidence features are reliable even in cases where the data sets are small for the model being induced.
Present local probability models. Analyzing Expression Data Present modeling assumptions Find high scoring networks. Characterize features. Estimate Statistical Confidence in features. Present local probability models.
Local Probability Models In order to specify a Bayesian network model we need to choose the type of local probability model we use. The choice of representation depends on the type of variables we use: Discrete Variables – can be represented with a table. Continuous variables – no representation for all possible densities.
Local Probability Models We consider two approaches: Multinomial Model – treat each variable as discrete and learn a multinomial distribution that describes the probability of each possible state of a child variable given the state of it’s parent. We descretize by setting a threshold to the ratio between measured expression and control: values lower than 2-0.5 are under_expressed(-1), and higher then 20.5 are over_expressed(1).
Local Probability Models 2. Linear Gaussian model – Learn a linear regression model for child variable give it’s parents. If U1,…,Uk are parent of variable X, then P(X|u1, …,uk) ~ N(a0 + ∑ai·ui, σ2). That is, X is normally distributed aroud a mean that depends linearly on the values of its parents.
Local Probability Models In the multinomial model, By Discretizing the mesaured expression levels we loose information. The linear-Gaussian model can only detect dependencies that are close to linear. In particular it is not likely to discover combinatorial effects (e.g. a gene is over expressed if and only if certain several genes are jointly over expressed).
Application to Cell Cycle Expression Patterns Data from Spellman et al. ‘Comprehensive identification of cell cycle-regulates genes of the yeast sacccharomyces cerevisia by microarray hybridization’, Mullecular Biology of the Cell. Contains 76 gene expression measurements of the mRNA levels of yeast. Spellman et al identified 800 genes whose expression varied over the different cell-cycle stages.
Application to Cell Cycle Expression Patterns Treat each mesaurement as an independent sample, and do not take into account the temporal aspect of the mesaurement. Compensate by adding to the root of all learned networks, a variable denoting the cell cycle phase. Performed 2 experiments: one with the multinomial distribution, and the other with linear Gausian distribution. http://www.cs.huji.ac.il/~nirf/GeneExpression/top800/
Robustness Analysis – Credibility of confidence assessment Linear Gaussian - Order Number of features with confidence equal or higer then the x-value Confidence threshold
Robustness Analysis – Credibility of confidence assessment Multinomial - Order Number of features with confidence equal or higer then the x-value Confidence threshold
Robustness Analysis – Adding more genes Multinomial model x: confidence with 250 genes, y: with 800
Robustness Analysis – discretization The descritization methos penalizes genes whose natural range of variation is small since a fixed threshold is used. Avoid the problem by normalizing the expression of genes in the data. The top 20 Markov relations highlighted by this method were a bit different, and the order relation were more robust. Possibly because order relations depend on the network structure and not local.
Robustness Analysis – compare between the linear-Gausian and multinomial experiments. X-axis: confidence in the multinomial experiment Y-axis: confidence in the Linear Gaussian experiment
Biological Analysis – Order Relations Found existence of dominant genes. Out of all 800 genes only few seem to dominant the order. Among them are genes that are: directly involove in initiation of the cell-cycle and its control. components of pre-replication complexes. involved in DNA repair ,which are associated with transcription initiation.
Biological Analysis - Markov Relations Among top scoring relations, all involving two known genes make sense biology. Several of the unknown pairs are phsically adjacent on the chromozome, and persumably regulatedby the same mechanism Some relations are beyond limitations of clustering.
Example: CLN2,RNR3,SVS1,SRO4,RAD51 all appear in the same cluster by spellman et al. In our network CLN2 is a parent of the other 4, while no links were found between them. This suit biological knowledege: CLN2 is a central and early cell-cycle control, while there is no clear biological relationship between the others.
discussion The approach is capable of handling noise and estimating the confidence in the different features in the network. managed to extract many biologically plausible conclusions. Is capable of learning rich structures from the data, such as discovering causal relationships, and interaction between genes other then positive correlation.
discussion Ides: learn models over “clustered” genes. recover all relationships in one analysis. Improving the confidence estimation. Incorporating biological knowledge as prior knowledge to the analysis. Learn causal patterns, while adding intervantional data.
The End