Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven ESAT-SCD (SISTA) on leave at Center for Biological Sequence.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Basic Gene Expression Data Analysis--Clustering
METHODS FOR HAPLOTYPE RECONSTRUCTION
Cluster analysis for microarray data Anja von Heydebreck.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Machine Learning and Data Mining Clustering
BAYESIAN INFERENCE Sampling techniques
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Clustering II.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Mutual Information Mathematical Biology Seminar
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Lecture 5: Learning models using EM
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Transcription factor binding motifs (part I) 10/17/07.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University
Probabilistic methods for phylogenetic trees (Part 2)
Bayesian Analysis for Extreme Events Pao-Shin Chu and Xin Zhao Department of Meteorology School of Ocean & Earth Science & Technology University of Hawaii-
Mixture Modeling Chongming Yang Research Support Center FHSS College.
Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Whole Genome Expression Analysis
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
More on Microarrays Chitta Baral Arizona State University.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Microarrays to Functional Genomics: Generation of Transcriptional Networks from Microarray experiments Joshua Stender December 3, 2002 Department of Biochemistry.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Lecture 2: Statistical learning primer for biologists
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
From Expression to Regulation: the online analysis of microarray data Gert Thijs K.U.Leuven, Belgium ESAT-SCD.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Machine Learning and Data Mining Clustering
A Very Basic Gibbs Sampler for Motif Detection
Two études on modularity
Learning Sequence Motif Models Using Expectation Maximization (EM)
Markov Networks.
Clustering.
Text Categorization Berlin Chen 2003 Reference:
Machine Learning and Data Mining Clustering
Stochastic Methods.
Presentation transcript:

Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven ESAT-SCD (SISTA) on leave at Center for Biological Sequence analysis, Danish Technical University

June 26, 2015CBS Microarray Course2 Clustering Form coherent groups of Genes Patient samples (e.g., tumors) Drug or toxin response Study these groups to get insight into biological processes Diagnostic and prognostic classes Genes in same clusters can have same function or same regulation Clustering algorithms Hierarchical clustering K-means Self-Organizing Maps...

June 26, 2015CBS Microarray Course3 What’s wrong with clustering? Clustering is a long-solved problem ?!? Many problems with current clustering algorithms PCA does not do any form of grouping Hierarchical clustering does not produce distinct groups Only a tree; it is then up to the user to pick nodes from the tree K-means does not tell you how many clusters really are present in the data...

June 26, 2015CBS Microarray Course4 A wish list for clustering We expect a lot from a clustering algorithm Fast and not memory hungry Can run easily on a large microarray data set genes, >100 experiments Partitioning of genes into distinct groups and automatically determine the “right” number of groups Robust If you remove some genes and some experiments, you want to obtain roughly the same groups Rejection of outliers (genes that do not clearly belong to any group) Probabilistic cluster membership One gene can belong to several clusters Incorporation of biological knowledge into account Maybe you want some known genes to cluster together Meaning of the clusters? Heterogeneous microarray data sources

June 26, 2015CBS Microarray Course5 Biclustering microarray data

June 26, 2015CBS Microarray Course6 Microarray cost per expression measurement  Budgets and expertise  Publicly available microarray data  Need for exchange standards & repositories Big consortia set up big microarray projects Genome projects  “transcriptome” projects (= compendia) Change in microarray projects (  sequence analysis) Analyze public data first to generate an hypothesis Design and perform your own microarray experiment From genome projects to transcriptome projects

June 26, 2015CBS Microarray Course7 Data becomes more heterogeneous Gene clustering Group genes that behave similarly over all conditions Gene biclustering Group genes that behave similarly over a subset of conditions “Feature selection” More suitable for heterogeneous compendium Why biclustering?

June 26, 2015CBS Microarray Course8 Probabilistic graphical models Biostatistics Bayesian stats Clustering Decision support Genetics Linkage analysis Phylogeny Sequence analysis Modeling protein families Gene prediction Regulatory sequence analysis Expression analysis Clustering Genetic network inference Graphical models

June 26, 2015CBS Microarray Course9 Distribution of expression values for a given gene High Medium Low Bicluster Discretized microarray data set Discretizing microarray data Microarray data is continuous Discretize by equal frequency genes conditions

June 26, 2015CBS Microarray Course10 Bicluster

June 26, 2015CBS Microarray Course11 Likelihood 0 1 Background Pattern

June 26, 2015CBS Microarray Course12 Likelihood 0 1   .9 .9 .9   .9       .9     .9 .05 .9   .9       .9      .9 .9 .9   .9       .9   .05 .9 .9   .9       .9      .9 .9 .9   .9       .05       

June 26, 2015CBS Microarray Course13 Likelihood 0 1   .9 .05 .05   .05       .9      .05 .9 .9   .05       .05    .05 .05 .05   .05       .05     .05 .05 .9   .9       .05           Get the right genes

June 26, 2015CBS Microarray Course14 Likelihood 0 1   .9 .9   .05       .05 .9     .9 .05   .05       .9 .9      .9 .9   .05       .05 .9   .05 .9   .05       .05 .9      .9 .9   .05       .05 .05        Get the right conditions

June 26, 2015CBS Microarray Course15 Likelihood 0 1   .6 .6 .2   .2       .6     .6 .2 .2   .2       .6      .6 .6 .2   .2       .6   .2 .6 .2   .2       .6      .2 .6 .2   .2       .2        Get the right frequency pattern

June 26, 2015CBS Microarray Course16 Optimizing the bicluster Find the right bicluster Genes Conditions Pattern For a given choice of genes and conditions, the “best” pattern is given by the frequencies found in the extracted pattern No more need to optimize over the pattern Maximum likelihood: find genes and conditions that maximize Gibbs sampling: find genes and conditions that optimize

June 26, 2015CBS Microarray Course17 Gibbs sampling

June 26, 2015CBS Microarray Course18 Markov Chain Monte-Carlo Markov chain with transition matrix T A C G T A C G T X= A X= C X= G X= T

June 26, 2015CBS Microarray Course19 Markov Chain Monte-Carlo Markov chains can sample from complex distributions ACGCGGTGTGCGTTTGACGA ACGGTTACGCGACGTTTGGT ACGTGCGGTGTACGTGTACG ACGGAGTTTGCGGGACGCGT ACGCGCGTGACGTACGCGTG AGACGCGTGCGCGCGGACGC ACGGGCGTGCGCGCGTCGCG AACGCGTTTGTGTTCGGTGC ACCGCGTTTGACGTCGGTTC ACGTGACGCGTAGTTCGACG ACGTGACACGGACGTACGCG ACCGTACTCGCGTTGACACG ATACGGCGCGGCGGGCGCGG ACGTACGCGTACACGCGGGA ACGCGCGTGTTTACGACGTG ACGTCGCACGCGTCGGTGTG ACGGCGGTCGGTACACGTCG ACGTTGCGACGTGCGTGCTG ACGGAACGACGACGCGACGC ACGGCGTGTTCGCGGTGCGG A C G T % Position

June 26, 2015CBS Microarray Course20 Gibbs sampling Markov chain for Gibbs sampling

June 26, 2015CBS Microarray Course21 Gibbs sampling True target distribution (2D normal N( ,  ) )

June 26, 2015CBS Microarray Course22 Gibbs sampling First 20 Gibbs sampling iterates (conditionals are 1D normals)

June 26, 2015CBS Microarray Course23 Gibbs sampling Burn-in samples (1000 samples)

June 26, 2015CBS Microarray Course24 Gibbs sampling Samples after Markov chain convergence (samples )

June 26, 2015CBS Microarray Course25 Data augmentation Gibbs sampling Introducing unobserved variables often simplifies the expression of the likelihood A Gibbs sampler can then be set up Samples from the Gibbs sampler can be used to estimate parameters

June 26, 2015CBS Microarray Course26 Pros and cons Gibbs sampling Explore the space of configuration of a probabilistic model of the data according to the probability of each configuration Based on incrementaly perturbing the configuration one variable at a time, preferably choosing more likely configurations Pros Clear probabilistic interpretation Bayesian framework “Global optimization” Cons Mathematical details not easy to work out Relatively slow

June 26, 2015CBS Microarray Course27 Gibbs biclustering

June 26, 2015CBS Microarray Course28 Gibbs sampling Current configuration Next gene configuration

June 26, 2015CBS Microarray Course29 Updated gene configuration Next complete configuration  iterate many times

June 26, 2015CBS Microarray Course30 Gibbs biclustering

June 26, 2015CBS Microarray Course31 Simulated data

June 26, 2015CBS Microarray Course32 Remarks Gibbs biclustering allows noisy patterns Optimized configuration is obtained by averaging successive iterated configurations Biclustering is oriented Find subset of samples for which a subset of genes is consistenly expressed across genes Find subset of genes that are consistently expressed across a subset of samples Searching for multiple patterns For gene biclustering, remove the data of the genes from the current bicluster Search for a new pattern Stop if only empty pattern repeatedly found

June 26, 2015CBS Microarray Course33 Multiple biclusters

June 26, 2015CBS Microarray Course34 Leukemia fingerprints

June 26, 2015CBS Microarray Course35 Mixed-Lineage Leukemia Armstrong et al., Nature Genetics, 2002 Mixed-Lineage Leukemia (MLL) is a subtype of ALL Caused by chromosomal rearrangement in MLL gene Poorer prognosis than ALL Microarray analysis shows that MLL is distinct from ALL FLT3 tyrosine kinase distinguishes most strongly between MLL, ALL, and AML Candidate drug target

June 26, 2015CBS Microarray Course36 PCAFeatures

June 26, 2015CBS Microarray Course37 Biclustering leukemia data Bicluster patients Find patients for which a subset of genes has a consistent expression profile across this group of patients Discovery set 21 ALL, 17 MLL, 25 AML Validation set 3 ALL, 3 MLL, 3 AML

June 26, 2015CBS Microarray Course38 Discovering ALL Bicluster 1: 18 out of 21 ALL patients

June 26, 2015CBS Microarray Course39 Discovering MLL Bicluster 2: 14 out of 17 MLL patients

June 26, 2015CBS Microarray Course40 Discovering AML Bicluster 3: 19 out of 25 AML patients

June 26, 2015CBS Microarray Course41 Rescoring ALL

June 26, 2015CBS Microarray Course42 Rescoring MLL

June 26, 2015CBS Microarray Course43 Rescoring AML

K.U.Leuven ESAT-SCD-Bioi Qizheng Sheng