Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven ESAT-SCD (SISTA) on leave at Center for Biological Sequence.

Similar presentations


Presentation on theme: "Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven ESAT-SCD (SISTA) on leave at Center for Biological Sequence."— Presentation transcript:

1 Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven ESAT-SCD (SISTA) on leave at Center for Biological Sequence analysis, Danish Technical University

2 June 26, 2015CBS Microarray Course2 Clustering Form coherent groups of Genes Patient samples (e.g., tumors) Drug or toxin response Study these groups to get insight into biological processes Diagnostic and prognostic classes Genes in same clusters can have same function or same regulation Clustering algorithms Hierarchical clustering K-means Self-Organizing Maps...

3 June 26, 2015CBS Microarray Course3 What’s wrong with clustering? Clustering is a long-solved problem ?!? Many problems with current clustering algorithms PCA does not do any form of grouping Hierarchical clustering does not produce distinct groups Only a tree; it is then up to the user to pick nodes from the tree K-means does not tell you how many clusters really are present in the data...

4 June 26, 2015CBS Microarray Course4 A wish list for clustering We expect a lot from a clustering algorithm Fast and not memory hungry Can run easily on a large microarray data set 10-100.000 genes, >100 experiments Partitioning of genes into distinct groups and automatically determine the “right” number of groups Robust If you remove some genes and some experiments, you want to obtain roughly the same groups Rejection of outliers (genes that do not clearly belong to any group) Probabilistic cluster membership One gene can belong to several clusters Incorporation of biological knowledge into account Maybe you want some known genes to cluster together Meaning of the clusters? Heterogeneous microarray data sources

5 June 26, 2015CBS Microarray Course5 Biclustering microarray data

6 June 26, 2015CBS Microarray Course6 Microarray cost per expression measurement  Budgets and expertise  Publicly available microarray data  Need for exchange standards & repositories Big consortia set up big microarray projects Genome projects  “transcriptome” projects (= compendia) Change in microarray projects (  sequence analysis) Analyze public data first to generate an hypothesis Design and perform your own microarray experiment From genome projects to transcriptome projects

7 June 26, 2015CBS Microarray Course7 Data becomes more heterogeneous Gene clustering Group genes that behave similarly over all conditions Gene biclustering Group genes that behave similarly over a subset of conditions “Feature selection” More suitable for heterogeneous compendium Why biclustering?

8 June 26, 2015CBS Microarray Course8 Probabilistic graphical models Biostatistics Bayesian stats Clustering Decision support Genetics Linkage analysis Phylogeny Sequence analysis Modeling protein families Gene prediction Regulatory sequence analysis Expression analysis Clustering Genetic network inference Graphical models

9 June 26, 2015CBS Microarray Course9 Distribution of expression values for a given gene High Medium Low Bicluster Discretized microarray data set Discretizing microarray data Microarray data is continuous Discretize by equal frequency genes conditions

10 June 26, 2015CBS Microarray Course10 Bicluster

11 June 26, 2015CBS Microarray Course11 Likelihood 0 1 Background Pattern

12 June 26, 2015CBS Microarray Course12 Likelihood 0 1   .9 .9 .9   .9       .9     .9 .05 .9   .9       .9      .9 .9 .9   .9       .9   .05 .9 .9   .9       .9      .9 .9 .9   .9       .05       

13 June 26, 2015CBS Microarray Course13 Likelihood 0 1   .9 .05 .05   .05       .9      .05 .9 .9   .05       .05    .05 .05 .05   .05       .05     .05 .05 .9   .9       .05           Get the right genes

14 June 26, 2015CBS Microarray Course14 Likelihood 0 1   .9 .9   .05       .05 .9     .9 .05   .05       .9 .9      .9 .9   .05       .05 .9   .05 .9   .05       .05 .9      .9 .9   .05       .05 .05        Get the right conditions

15 June 26, 2015CBS Microarray Course15 Likelihood 0 1   .6 .6 .2   .2       .6     .6 .2 .2   .2       .6      .6 .6 .2   .2       .6   .2 .6 .2   .2       .6      .2 .6 .2   .2       .2        Get the right frequency pattern

16 June 26, 2015CBS Microarray Course16 Optimizing the bicluster Find the right bicluster Genes Conditions Pattern For a given choice of genes and conditions, the “best” pattern is given by the frequencies found in the extracted pattern No more need to optimize over the pattern Maximum likelihood: find genes and conditions that maximize Gibbs sampling: find genes and conditions that optimize

17 June 26, 2015CBS Microarray Course17 Gibbs sampling

18 June 26, 2015CBS Microarray Course18 Markov Chain Monte-Carlo Markov chain with transition matrix T A C G T A 0.0643 0.8268 0.0659 0.0430 C 0.0598 0.0484 0.8515 0.0403 G 0.1602 0.3407 0.1736 0.3255 T 0.1507 0.1608 0.3654 0.3231 X= A X= C X= G X= T

19 June 26, 2015CBS Microarray Course19 Markov Chain Monte-Carlo Markov chains can sample from complex distributions ACGCGGTGTGCGTTTGACGA ACGGTTACGCGACGTTTGGT ACGTGCGGTGTACGTGTACG ACGGAGTTTGCGGGACGCGT ACGCGCGTGACGTACGCGTG AGACGCGTGCGCGCGGACGC ACGGGCGTGCGCGCGTCGCG AACGCGTTTGTGTTCGGTGC ACCGCGTTTGACGTCGGTTC ACGTGACGCGTAGTTCGACG ACGTGACACGGACGTACGCG ACCGTACTCGCGTTGACACG ATACGGCGCGGCGGGCGCGG ACGTACGCGTACACGCGGGA ACGCGCGTGTTTACGACGTG ACGTCGCACGCGTCGGTGTG ACGGCGGTCGGTACACGTCG ACGTTGCGACGTGCGTGCTG ACGGAACGACGACGCGACGC ACGGCGTGTTCGCGGTGCGG A C G T % Position

20 June 26, 2015CBS Microarray Course20 Gibbs sampling Markov chain for Gibbs sampling

21 June 26, 2015CBS Microarray Course21 Gibbs sampling True target distribution (2D normal N( ,  ) )

22 June 26, 2015CBS Microarray Course22 Gibbs sampling First 20 Gibbs sampling iterates (conditionals are 1D normals)

23 June 26, 2015CBS Microarray Course23 Gibbs sampling Burn-in samples (1000 samples)

24 June 26, 2015CBS Microarray Course24 Gibbs sampling Samples after Markov chain convergence (samples 1000-2000)

25 June 26, 2015CBS Microarray Course25 Data augmentation Gibbs sampling Introducing unobserved variables often simplifies the expression of the likelihood A Gibbs sampler can then be set up Samples from the Gibbs sampler can be used to estimate parameters

26 June 26, 2015CBS Microarray Course26 Pros and cons Gibbs sampling Explore the space of configuration of a probabilistic model of the data according to the probability of each configuration Based on incrementaly perturbing the configuration one variable at a time, preferably choosing more likely configurations Pros Clear probabilistic interpretation Bayesian framework “Global optimization” Cons Mathematical details not easy to work out Relatively slow

27 June 26, 2015CBS Microarray Course27 Gibbs biclustering

28 June 26, 2015CBS Microarray Course28 Gibbs sampling Current configuration Next gene configuration

29 June 26, 2015CBS Microarray Course29 Updated gene configuration Next complete configuration  iterate many times

30 June 26, 2015CBS Microarray Course30 Gibbs biclustering

31 June 26, 2015CBS Microarray Course31 Simulated data

32 June 26, 2015CBS Microarray Course32 Remarks Gibbs biclustering allows noisy patterns Optimized configuration is obtained by averaging successive iterated configurations Biclustering is oriented Find subset of samples for which a subset of genes is consistenly expressed across genes Find subset of genes that are consistently expressed across a subset of samples Searching for multiple patterns For gene biclustering, remove the data of the genes from the current bicluster Search for a new pattern Stop if only empty pattern repeatedly found

33 June 26, 2015CBS Microarray Course33 Multiple biclusters

34 June 26, 2015CBS Microarray Course34 Leukemia fingerprints

35 June 26, 2015CBS Microarray Course35 Mixed-Lineage Leukemia Armstrong et al., Nature Genetics, 2002 Mixed-Lineage Leukemia (MLL) is a subtype of ALL Caused by chromosomal rearrangement in MLL gene Poorer prognosis than ALL Microarray analysis shows that MLL is distinct from ALL FLT3 tyrosine kinase distinguishes most strongly between MLL, ALL, and AML Candidate drug target

36 June 26, 2015CBS Microarray Course36 PCAFeatures

37 June 26, 2015CBS Microarray Course37 Biclustering leukemia data Bicluster patients Find patients for which a subset of genes has a consistent expression profile across this group of patients Discovery set 21 ALL, 17 MLL, 25 AML Validation set 3 ALL, 3 MLL, 3 AML

38 June 26, 2015CBS Microarray Course38 Discovering ALL Bicluster 1: 18 out of 21 ALL patients

39 June 26, 2015CBS Microarray Course39 Discovering MLL Bicluster 2: 14 out of 17 MLL patients

40 June 26, 2015CBS Microarray Course40 Discovering AML Bicluster 3: 19 out of 25 AML patients

41 June 26, 2015CBS Microarray Course41 Rescoring ALL

42 June 26, 2015CBS Microarray Course42 Rescoring MLL

43 June 26, 2015CBS Microarray Course43 Rescoring AML

44 K.U.Leuven ESAT-SCD-Bioi Qizheng Sheng


Download ppt "Gibbs biclustering of microarray data Yves Moreau & Qizheng Sheng Katholieke Universiteit Leuven ESAT-SCD (SISTA) on leave at Center for Biological Sequence."

Similar presentations


Ads by Google