Download presentation

Presentation is loading. Please wait.

Published byLee Drage Modified over 2 years ago

1
. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University

2
Introduction u New experimental methods abundance of data l Gene Expression l Genomic sequences l Protein levels l … u Data analysis methods are crucial for understanding such data u Clustering serves as tool for organizing the data and finding patterns in it

3
This Talk u New method for clustering l Combines different types data l Emphasis on learning context-specific description of the clusters u Application to gene expression data l Combine expression data with genomic information

4
The Data Experiments Genes i j The mRNA level of gene i in experiment j Goal: l Understand interactions between TF and expression levels Binding Sites The # of binding sites of TF j in promotor region of gene i k Microarray Data Genomic Data

5
Simple Clustering Model u attributes are independent given the cluster u Simple model computationally cheap u Genes are clustered according to both expression levels and binding sites Cluster A1A1 A2A2 A3A3 AnAn … TF 1 TF 2 TF 3 TF k …

6
Local Probability Models Cluster A1A1 A2A2 TF 1 TF 2 Multinomial Gaussian

7
Structure in Local Probability Models Cluster A1A1 A2A2 TF 1 TF 2

8
Cluster E1E1 E2E2 TF 1 TF 2 Context Specific Independence Benefits: u Identifies what features characterize each cluster u Reduces bias during learning u A compact and efficient representation {2,4} {} {1,2,4} {1,2,3,4,5}

9
Scoring CSI Cluster Models u Represent conditional probabilities with different parametric families l Gaussian, l Multinomial, l Poisson … u Choose parameters priors from appropriate conjugate prior families Score: where Marginal Likelihood Prior

10
Learning Structure – Naive Approach u A hard problem : “Standard” approach: C E1E1 E2E2 TF 1 TF 2 {2,4} {1,2,3} {} {2} Learn model parameters using EM Basic problem – efficiency Try “nearby” structures and Learn parameters for each one using EM. choose best structure C {2,4} {1,2,3} {3} {} ? E1E1 E2E2 TF 1 TF 2 C {} {1,2,3} {3} {2} ? E1E1 E2E2 TF 1 TF 2

11
Learning Structure – Structural EM We can evaluate each edge’s parameters separately given complete data for MAP we compute EM only once for each iteration Guaranteed to converge to a local optimum Learn model parameters using EM C E1E1 E2E2 TF 1 TF 2 {2,4} {1,2,3} {} {2} C E1E1 E2E2 TF 1 TF 2 {2,4} {3} {} {1,2,3} ? C E1E1 E2E2 TF 1 TF 2 {} {3} {2} {1,2,3} ? Use the “completed” data to evaluate each edge separately to find best model Soft assignment for genes Compute expected sufficient statistics

12
Results on Synthetic Data Basic approach: u Generate data from a known structure u Evaluate learned structures for different sample numbers (200 – 800). u Add “noise” of unrelated samples to the training set to simulate genes that do not fall into “nice” functional categories (10-30%). u Test learned model for structure as well as for correlation between it’s tagging and the one given by the original model. Main results: Cluster number: models with fewer clusters were sharply penalized. Often models with 1-2 additional clusters got similar score, with “degenerate” clusters none of the real samples where classified to. Structure accuracy: very few false negative edges, 10-20% false positive edges (score dependent) Mutual information Ratio: max for 800 samples, 100-95% for 500 and 90%~ for 200 samples. Learned clusters were very informative

13
Yeast Stress Data (Gasch et al 2001) u Examines response of yeast to stress situations u Total 93 arrays u We selected ~900 genes that changed in a selective manner Treatment steps: u Initial clustering u Found putative binding sites based on clusters u Re-clustered with these sites

14
Stress Data -- CSI Clusters

15
CSI Clusters mean expression level -2 0 1 2 3 4 HSF HSF variable diamide H2O2 Menadione DDT sorbitol Nitrogen Dep. Diauxic shift YPD Starvation YP Steady

16
Promoters Analysis Cluster 3 l MIG1 CCCCGC, CGGACC, ACCCCG l GAL4 CGGGCC l Others CCAATCA mean expression level -2 0 1 2 3 4 HSF HSF variable diamide H2O2 Menadione DDT sorbitol Nitrogen Dep. Diauxic shift YPD Starvation YP Steady

17
Promoters Analysis Cluster 7 l GCN4 TGACTCA l Others CGGAAAA, ACTGTGG mean expression level -2 0 1 2 3 4 HSF HSF variable diamide H2O2 Menadione DDT sorbitol Nitrogen Dep. Diauxic shift YPD Starvation YP Steady

18
Discussion Goals: u Identify binding sites/transcription factors u Understand interactions among transcription factors l “Combinatorial effects” on expression u Predict role/function of the genes Methods: u Integration of model of statistical patterns of binding sites (see Holmes & Bruno, ISMB’00) u Additional dependencies among attributes l Tree augmented Naive Bayes l Probabilistic Relational Models (see poster)

Similar presentations

OK

From Sequence to Expression: A Probabilistic Framework Eran Segal (Stanford) Joint work with: Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Inst.)

From Sequence to Expression: A Probabilistic Framework Eran Segal (Stanford) Joint work with: Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Inst.)

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on air pollution in hindi Ppt on asian continent outline Ppt on philosophy of science Ppt on paintings and photographs related to colonial period Ppt on bugatti veyron engine File type ppt on cybercrime virus Ppt on measuring area and volume 3rd grade Types of clouds for kids ppt on batteries Ppt on network switching software Ppt on renewable energy and environment protection