. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University.

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

ABSTRACT: We examine how to determine the number of states of a hidden variables when learning probabilistic models. This problem is crucial for improving.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Periodic clusters. Non periodic clusters That was only the beginning…

Inferring Quantitative Models of Regulatory Networks From Expression Data Iftach Nachman Hebrew University Aviv Regev Harvard Nir Friedman Hebrew University.

1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.

Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.

. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.

A Probabilistic Dynamical Model for Quantitative Inference of the Regulatory Mechanism of Transcription Guido Sanguinetti, Magnus Rattray and Neil D. Lawrence.

Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.

From Sequence to Expression: A Probabilistic Framework Eran Segal (Stanford) Joint work with: Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Inst.)

Machine Learning and Data Mining Clustering

Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.

Work Process Using Enrich Load biological data Check enrichment of crossed data sets Extract statistically significant results Multiple hypothesis correction.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Lecture 5: Learning models using EM

Functional annotation and network reconstruction through cross-platform integration of microarray data X. J. Zhou et al

The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &

Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.

Thanks to Nir Friedman, HU

Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work? Reg. ACGTGC.

Analysing Microarray Data Using Bayesian Network Learning Name: Phirun Son Supervisor: Dr. Lin Liu.

High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.

1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:

Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.

Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.

Chapter 16 The Chi-Square Statistic

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Randomized Algorithms for Bayesian Hierarchical Clustering

Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Introduction to biological molecular networks

Cluster validation Integration ICES Bioinformatics.

Flat clustering approaches

1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Inference with Gene Expression and Sequence Data BMI/CS 776 Mark Craven April 2002.

Data Mining and Decision Support

1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:

1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,

Thanh Le, Katheleen J. Gardiner University of Colorado Denver

Review of statistical modeling and probability theory Alan Moses ML4bio.

Gaussian Process Networks Nir Friedman and Iftach Nachman UAI-2K.

Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.

Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.

1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.

Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Boosted Augmented Naive Bayes. Efficient discriminative learning of

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Network Inference Chris Holmes Oxford Centre for Gene Function, &,

Prepared by: Mahmoud Rafeek Al-Farra

Parameter Learning 2 Structure Learning 1: The good

Parametric Methods Berlin Chen, 2005 References:

Presentation transcript:

. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University

Introduction u New experimental methods  abundance of data l Gene Expression l Genomic sequences l Protein levels l … u Data analysis methods are crucial for understanding such data u Clustering serves as tool for organizing the data and finding patterns in it

This Talk u New method for clustering l Combines different types data l Emphasis on learning context-specific description of the clusters u Application to gene expression data l Combine expression data with genomic information

The Data Experiments Genes i j The mRNA level of gene i in experiment j Goal: l Understand interactions between TF and expression levels Binding Sites The # of binding sites of TF j in promotor region of gene i k Microarray Data Genomic Data

Simple Clustering Model u attributes are independent given the cluster u Simple model  computationally cheap u Genes are clustered according to both expression levels and binding sites Cluster A1A1 A2A2 A3A3 AnAn … TF 1 TF 2 TF 3 TF k …

Local Probability Models Cluster A1A1 A2A2 TF 1 TF 2 Multinomial Gaussian

Structure in Local Probability Models Cluster A1A1 A2A2 TF 1 TF 2

Cluster E1E1 E2E2 TF 1 TF 2 Context Specific Independence Benefits: u Identifies what features characterize each cluster u Reduces bias during learning u A compact and efficient representation {2,4} {} {1,2,4} {1,2,3,4,5}

Scoring CSI Cluster Models u Represent conditional probabilities with different parametric families l Gaussian, l Multinomial, l Poisson … u Choose parameters priors from appropriate conjugate prior families Score: where Marginal Likelihood Prior

Learning Structure – Naive Approach u A hard problem :  “Standard” approach: C E1E1 E2E2 TF 1 TF 2 {2,4} {1,2,3} {} {2} Learn model parameters using EM Basic problem – efficiency Try “nearby” structures and Learn parameters for each one using EM. choose best structure C {2,4} {1,2,3} {3} {} ? E1E1 E2E2 TF 1 TF 2 C {} {1,2,3} {3} {2} ? E1E1 E2E2 TF 1 TF 2

Learning Structure – Structural EM We can evaluate each edge’s parameters separately given complete data for MAP we compute EM only once for each iteration Guaranteed to converge to a local optimum Learn model parameters using EM C E1E1 E2E2 TF 1 TF 2 {2,4} {1,2,3} {} {2} C E1E1 E2E2 TF 1 TF 2 {2,4} {3} {} {1,2,3} ? C E1E1 E2E2 TF 1 TF 2 {} {3} {2} {1,2,3} ? Use the “completed” data to evaluate each edge separately to find best model Soft assignment for genes Compute expected sufficient statistics

Results on Synthetic Data Basic approach: u Generate data from a known structure u Evaluate learned structures for different sample numbers (200 – 800). u Add “noise” of unrelated samples to the training set to simulate genes that do not fall into “nice” functional categories (10-30%). u Test learned model for structure as well as for correlation between it’s tagging and the one given by the original model. Main results: Cluster number: models with fewer clusters were sharply penalized. Often models with 1-2 additional clusters got similar score, with “degenerate” clusters none of the real samples where classified to. Structure accuracy: very few false negative edges, 10-20% false positive edges (score dependent) Mutual information Ratio: max for 800 samples, % for 500 and 90%~ for 200 samples. Learned clusters were very informative

Yeast Stress Data (Gasch et al 2001) u Examines response of yeast to stress situations u Total 93 arrays u We selected ~900 genes that changed in a selective manner Treatment steps: u Initial clustering u Found putative binding sites based on clusters u Re-clustered with these sites

Stress Data -- CSI Clusters

CSI Clusters mean expression level HSF HSF variable diamide H2O2 Menadione DDT sorbitol Nitrogen Dep. Diauxic shift YPD Starvation YP Steady

Promoters Analysis Cluster 3 l MIG1 CCCCGC, CGGACC, ACCCCG l GAL4 CGGGCC l Others CCAATCA mean expression level HSF HSF variable diamide H2O2 Menadione DDT sorbitol Nitrogen Dep. Diauxic shift YPD Starvation YP Steady

Promoters Analysis Cluster 7 l GCN4 TGACTCA l Others CGGAAAA, ACTGTGG mean expression level HSF HSF variable diamide H2O2 Menadione DDT sorbitol Nitrogen Dep. Diauxic shift YPD Starvation YP Steady

Discussion Goals: u Identify binding sites/transcription factors u Understand interactions among transcription factors l “Combinatorial effects” on expression u Predict role/function of the genes Methods: u Integration of model of statistical patterns of binding sites (see Holmes & Bruno, ISMB’00) u Additional dependencies among attributes l Tree augmented Naive Bayes l Probabilistic Relational Models (see poster)