Download presentation

Presentation is loading. Please wait.

Published byLee Drage Modified over 2 years ago

1
. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University

2
Introduction u New experimental methods abundance of data l Gene Expression l Genomic sequences l Protein levels l … u Data analysis methods are crucial for understanding such data u Clustering serves as tool for organizing the data and finding patterns in it

3
This Talk u New method for clustering l Combines different types data l Emphasis on learning context-specific description of the clusters u Application to gene expression data l Combine expression data with genomic information

4
The Data Experiments Genes i j The mRNA level of gene i in experiment j Goal: l Understand interactions between TF and expression levels Binding Sites The # of binding sites of TF j in promotor region of gene i k Microarray Data Genomic Data

5
Simple Clustering Model u attributes are independent given the cluster u Simple model computationally cheap u Genes are clustered according to both expression levels and binding sites Cluster A1A1 A2A2 A3A3 AnAn … TF 1 TF 2 TF 3 TF k …

6
Local Probability Models Cluster A1A1 A2A2 TF 1 TF 2 Multinomial Gaussian

7
Structure in Local Probability Models Cluster A1A1 A2A2 TF 1 TF 2

8
Cluster E1E1 E2E2 TF 1 TF 2 Context Specific Independence Benefits: u Identifies what features characterize each cluster u Reduces bias during learning u A compact and efficient representation {2,4} {} {1,2,4} {1,2,3,4,5}

9
Scoring CSI Cluster Models u Represent conditional probabilities with different parametric families l Gaussian, l Multinomial, l Poisson … u Choose parameters priors from appropriate conjugate prior families Score: where Marginal Likelihood Prior

10
Learning Structure – Naive Approach u A hard problem : “Standard” approach: C E1E1 E2E2 TF 1 TF 2 {2,4} {1,2,3} {} {2} Learn model parameters using EM Basic problem – efficiency Try “nearby” structures and Learn parameters for each one using EM. choose best structure C {2,4} {1,2,3} {3} {} ? E1E1 E2E2 TF 1 TF 2 C {} {1,2,3} {3} {2} ? E1E1 E2E2 TF 1 TF 2

11
Learning Structure – Structural EM We can evaluate each edge’s parameters separately given complete data for MAP we compute EM only once for each iteration Guaranteed to converge to a local optimum Learn model parameters using EM C E1E1 E2E2 TF 1 TF 2 {2,4} {1,2,3} {} {2} C E1E1 E2E2 TF 1 TF 2 {2,4} {3} {} {1,2,3} ? C E1E1 E2E2 TF 1 TF 2 {} {3} {2} {1,2,3} ? Use the “completed” data to evaluate each edge separately to find best model Soft assignment for genes Compute expected sufficient statistics

12
Results on Synthetic Data Basic approach: u Generate data from a known structure u Evaluate learned structures for different sample numbers (200 – 800). u Add “noise” of unrelated samples to the training set to simulate genes that do not fall into “nice” functional categories (10-30%). u Test learned model for structure as well as for correlation between it’s tagging and the one given by the original model. Main results: Cluster number: models with fewer clusters were sharply penalized. Often models with 1-2 additional clusters got similar score, with “degenerate” clusters none of the real samples where classified to. Structure accuracy: very few false negative edges, 10-20% false positive edges (score dependent) Mutual information Ratio: max for 800 samples, 100-95% for 500 and 90%~ for 200 samples. Learned clusters were very informative

13
Yeast Stress Data (Gasch et al 2001) u Examines response of yeast to stress situations u Total 93 arrays u We selected ~900 genes that changed in a selective manner Treatment steps: u Initial clustering u Found putative binding sites based on clusters u Re-clustered with these sites

14
Stress Data -- CSI Clusters

15
CSI Clusters mean expression level -2 0 1 2 3 4 HSF HSF variable diamide H2O2 Menadione DDT sorbitol Nitrogen Dep. Diauxic shift YPD Starvation YP Steady

16
Promoters Analysis Cluster 3 l MIG1 CCCCGC, CGGACC, ACCCCG l GAL4 CGGGCC l Others CCAATCA mean expression level -2 0 1 2 3 4 HSF HSF variable diamide H2O2 Menadione DDT sorbitol Nitrogen Dep. Diauxic shift YPD Starvation YP Steady

17
Promoters Analysis Cluster 7 l GCN4 TGACTCA l Others CGGAAAA, ACTGTGG mean expression level -2 0 1 2 3 4 HSF HSF variable diamide H2O2 Menadione DDT sorbitol Nitrogen Dep. Diauxic shift YPD Starvation YP Steady

18
Discussion Goals: u Identify binding sites/transcription factors u Understand interactions among transcription factors l “Combinatorial effects” on expression u Predict role/function of the genes Methods: u Integration of model of statistical patterns of binding sites (see Holmes & Bruno, ISMB’00) u Additional dependencies among attributes l Tree augmented Naive Bayes l Probabilistic Relational Models (see poster)

Similar presentations

OK

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google