Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all.

Slides:



Advertisements
Similar presentations
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Advertisements

Introduction to Monte Carlo Markov chain (MCMC) methods
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Linear Models for Microarray Data
A Tutorial on Learning with Bayesian Networks
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Mutual Information Mathematical Biology Seminar
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Comparing Means.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.
Bayesian Analysis for Extreme Events Pao-Shin Chu and Xin Zhao Department of Meteorology School of Ocean & Earth Science & Technology University of Hawaii-
Comparing Means.
Mixture Modeling Chongming Yang Research Support Center FHSS College.
Bayes Net Perspectives on Causation and Causal Inference
Automatic methods for functional annotation of sequences Petri Törönen.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
The success or failure of an investigation usually depends on the design of the experiment. Prepared by Odyssa NRM Molo.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
ANOVA: Analysis of Variance.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Dependency Networks for Collaborative Filtering and Data Visualization UAI-2000 발표 : 황규백.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Flat clustering approaches
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Expression profiling & functional genomics Exercises.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Cluster Analysis of Gene Expression Profiles
Cluster Analysis II 10/03/2012.
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
One Way ANOVAs One Way ANOVAs
Class #16 – Tuesday, October 26
Predicting Gene Expression from Sequence
Presentation transcript:

Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all nodes given values of its parents Topology of the DAG defines the conditional dependencies of all variables through the Markov directed Markov property which states that given the values of its parents, a variable in the model is independent of all its non- descendents DAG and local distributions define the joint probability distribution of data and all parameters in the model In our case this distribution can not be explicitly characterized but it estimates using Markov Chain Monte Carlo approach (Gibbs sampler)

Uses and Miss-Uses of Clustering Define a statistical model that facilitates clustering of genes based on similarities of their expression profiles Define the method-selection criteria that allows for estimating the "correct" number of clusters Show that inappropriate "pre-filtering" can fool the statistical model in the same way it fools the casual observer Show appropriate ways to use cluster analysis and illustrate the importance of using the "best available treatment"

Clustering of gene expression profiles

“Patterns” of Expression - Finite Mixture Model Pattern i   i =(  1i,  2i,…,  11i ) Data ik ~ iid N(  i,  ), k=1,…,n i n i =number of genes generated by the Pattern i  i =n i /n

“Patterns” of Expression - Finite Mixture Model Any gene profile x = (x 1,x 2,…,x 11 )  All data x 1, x 2,…, x n } Finite Mixture Model

One-dimensional mixture Pattern 1 Pattern 2 N(  11,  ) N(  12,  )

MCLUST > library(mclust) > SimData<-matrix(rnorm(5000*15),ncol=15) > ColLabels<-c(paste("Tumor_",1:8,sep=""),paste("Control_",1:7,sep="")) > heatmap(SimData,labCol=ColLabels) >.Mclust$hcModelNames<-c("E","EEI") >.Mclust$emModelNames<-c("EEI") > BIC.emclust<-EMclust(SimData,1:10) > BIC.emclust BIC: EEI > plot(BIC.emclust) EEI "1" >

Determining the number of patterns

MCLUST > p.value<-apply(SimData,1,function(x) t.test(x[1:8],x[9:15],var.equal=T)$p.value) > > SigData<-SimData[p.value<0.05,] > dim(SigData) [1] > heatmap(SigData,labCol=ColLabels) > > BIC.emclust<-EMclust(SigData,1:10) > BIC.emclust BIC: EEI > plot(BIC.emclust) EEI "1" >

Determining the number of patterns

Summary The "weak filter" based on selecting "sub-significant" differentially expressed genes created artificial clusters When the whole dataset was used, the Bayesian information criteria did the right thing by estimating the correct number of clusters to be equal to one Take home message: When "filtering" before clustering make sure that appropriate statistical significance levels have been used

Using clustering to find "patterns" among differentially expressed genes Cluster analysis is preceded by a rigorous statistical analysis For example-identify genes that were "differentially" expressed on at least one experimental comparison. Among all these genes some will have similar behavior across all experimental conditions Clustering is a way of organizing behavior of differentially expressed genes across different experimental conditions

Using clustering to find "patterns" among differentially expressed genes

Using clustering to find "patterns" among all genes No filtering is performed You can perform the "quality filtering" Trying to identify statistically significant patterns Using the best available method becomes extremely important

Does It matter which clustering procedure we use? Simple Commonly Used Method (Euclidian Distance Based Hierarchical Clustering) "Complicated" Method (Context-specific Infinite Mixtures) 5685 Yeast Genes Across Two Experiments (Cell Cycle and Sporulation) NO VARIABILITY BASED FILTER 135 Genes with closest co-expression partners

"Objective" Performance Assessments Using KEGG as the Gold Standard Due to a large imbalance between the total number of negative and positive pairs: There are 17 times more negative pairs than positive pairs - a small FPR can still produce more false positive than true positives

Summary Using clustering alone, one can identify "significant" patterns of expression when using appropriate methodology For example, Yeast data clustered in this example did not have any replicates so the traditional analysis to identify differentially expressed genes before clustering is not feasable Statistical significance of resulting clusters needs to be carefully examined

Infinite Bayesian Mixtures X C  M   r  w M=(  1,…,  K )  =(  1,…,  K )  =(  1,…,  K ) C=(c 1,…,c N ) c i  {1,…, K}

Conditional posterior distributions and Gibbs Sampler

Gibbs Sampler Result Sequence: (c k,1,….,c k,n ), k=1,…,k max such that Posterior distribution summarized through “posterior pairwise probabilities of co- expression” p(c i =c j |X)

Properties “Pooling” information from the whole dataset by estimating both “patterns” and “assignments” – similar to K-means (K-means is actually equivalent to a special case of the mixture models with known number of clusters) Does not require specification of the right number of clusters (unlike K-means) Gives direct estimates of statistical significance (unlike anything else on the market) Instead of lamenting which distance measure to use – focus on the appropriate statistical model which is a well-defined problem Works for any type of data

Finding important functional groups for up-regulated genes Using the "Ease" annotation tool We obtained following significant gene ontologies Up_DexANDNE2ANDirr_381_GO.htm Homework: 1) Download and install Ease 2) Select top 20 most-signficianly up-regulated genes in our W-C dataset and identify significantly over-represented categories (using the three-way ANOVA analysis) 3) Repeat the analysis with 30, 40, 50 and 100 up-regulated and down- regulated gene 4) Prepare questions for the next class regarding problems you run into