Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all.

Similar presentations


Presentation on theme: "Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all."— Presentation transcript:

1 Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all nodes given values of its parents Topology of the DAG defines the conditional dependencies of all variables through the Markov directed Markov property which states that given the values of its parents, a variable in the model is independent of all its non- descendents DAG and local distributions define the joint probability distribution of data and all parameters in the model In our case this distribution can not be explicitly characterized but it estimates using Markov Chain Monte Carlo approach (Gibbs sampler)

2 Uses and Miss-Uses of Clustering Define a statistical model that facilitates clustering of genes based on similarities of their expression profiles Define the method-selection criteria that allows for estimating the "correct" number of clusters Show that inappropriate "pre-filtering" can fool the statistical model in the same way it fools the casual observer Show appropriate ways to use cluster analysis and illustrate the importance of using the "best available treatment"

3 Clustering of gene expression profiles

4 “Patterns” of Expression - Finite Mixture Model Pattern i   i =(  1i,  2i,…,  11i ) Data ik ~ iid N(  i,  ), k=1,…,n i n i =number of genes generated by the Pattern i  i =n i /n

5 “Patterns” of Expression - Finite Mixture Model Any gene profile x = (x 1,x 2,…,x 11 )  All data x 1, x 2,…, x n } Finite Mixture Model

6 One-dimensional mixture Pattern 1 Pattern 2 N(  11,  ) N(  12,  )

7 MCLUST > library(mclust) > SimData<-matrix(rnorm(5000*15),ncol=15) > ColLabels<-c(paste("Tumor_",1:8,sep=""),paste("Control_",1:7,sep="")) > heatmap(SimData,labCol=ColLabels) >.Mclust$hcModelNames<-c("E","EEI") >.Mclust$emModelNames<-c("EEI") > BIC.emclust<-EMclust(SimData,1:10) > BIC.emclust BIC: EEI 1 -213490.3 2 -213624.9 3 -213753.0 4 -213880.7 5 -213993.7 6 -214121.0 7 -214243.4 8 -214351.6 9 -214481.4 10 -214588.7 > plot(BIC.emclust) EEI "1" >

8 Determining the number of patterns

9 MCLUST > p.value<-apply(SimData,1,function(x) t.test(x[1:8],x[9:15],var.equal=T)$p.value) > > SigData<-SimData[p.value<0.05,] > dim(SigData) [1] 242 15 > heatmap(SigData,labCol=ColLabels) > > BIC.emclust<-EMclust(SigData,1:10) > BIC.emclust BIC: EEI 1 -10599.485 2 -9647.645 3 -9685.897 4 -9729.239 5 -9796.119 6 -9849.109 7 -9912.601 8 -9973.645 9 -10037.436 10 -10077.862 > plot(BIC.emclust) EEI "1" >

10 Determining the number of patterns

11 Summary The "weak filter" based on selecting "sub-significant" differentially expressed genes created artificial clusters When the whole dataset was used, the Bayesian information criteria did the right thing by estimating the correct number of clusters to be equal to one Take home message: When "filtering" before clustering make sure that appropriate statistical significance levels have been used

12 Using clustering to find "patterns" among differentially expressed genes Cluster analysis is preceded by a rigorous statistical analysis For example-identify genes that were "differentially" expressed on at least one experimental comparison. Among all these genes some will have similar behavior across all experimental conditions Clustering is a way of organizing behavior of differentially expressed genes across different experimental conditions

13 Using clustering to find "patterns" among differentially expressed genes

14 Using clustering to find "patterns" among all genes No filtering is performed You can perform the "quality filtering" Trying to identify statistically significant patterns Using the best available method becomes extremely important

15 Does It matter which clustering procedure we use? Simple Commonly Used Method (Euclidian Distance Based Hierarchical Clustering) "Complicated" Method (Context-specific Infinite Mixtures) 5685 Yeast Genes Across Two Experiments (Cell Cycle and Sporulation) NO VARIABILITY BASED FILTER 135 Genes with closest co-expression partners

16 "Objective" Performance Assessments Using KEGG as the Gold Standard Due to a large imbalance between the total number of negative and positive pairs: There are 17 times more negative pairs than positive pairs - a small FPR can still produce more false positive than true positives

17 Summary Using clustering alone, one can identify "significant" patterns of expression when using appropriate methodology For example, Yeast data clustered in this example did not have any replicates so the traditional analysis to identify differentially expressed genes before clustering is not feasable Statistical significance of resulting clusters needs to be carefully examined

18 Infinite Bayesian Mixtures X C  M   r  w M=(  1,…,  K )  =(  1,…,  K )  =(  1,…,  K ) C=(c 1,…,c N ) c i  {1,…, K}

19 Conditional posterior distributions and Gibbs Sampler

20 Gibbs Sampler Result Sequence: (c k,1,….,c k,n ), k=1,…,k max such that Posterior distribution summarized through “posterior pairwise probabilities of co- expression” p(c i =c j |X)

21 Properties “Pooling” information from the whole dataset by estimating both “patterns” and “assignments” – similar to K-means (K-means is actually equivalent to a special case of the mixture models with known number of clusters) Does not require specification of the right number of clusters (unlike K-means) Gives direct estimates of statistical significance (unlike anything else on the market) Instead of lamenting which distance measure to use – focus on the appropriate statistical model which is a well-defined problem Works for any type of data

22 Finding important functional groups for up-regulated genes Using the "Ease" annotation tool http://david.niaid.nih.gov/david/http://david.niaid.nih.gov/david/ We obtained following significant gene ontologies Up_DexANDNE2ANDirr_381_GO.htm Homework: 1) Download and install Ease 2) Select top 20 most-signficianly up-regulated genes in our W-C dataset and identify significantly over-represented categories (using the three-way ANOVA analysis) 3) Repeat the analysis with 30, 40, 50 and 100 up-regulated and down- regulated gene 4) Prepare questions for the next class regarding problems you run into


Download ppt "Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all."

Similar presentations


Ads by Google