Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.

Slides:



Advertisements
Similar presentations
Clustering. How are we doing on the pass sequence? Pretty good! We can now automatically learn the features needed to track both people But, it sucks.
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Basic Gene Expression Data Analysis--Clustering
Social network partition Presenter: Xiaofei Cao Partick Berg.
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Clustering V. Outline Validating clustering results Randomization tests.
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
Clustering by Passing Messages Between Data Points Brendan J. Frey and Delbert Dueck Science, 2007.
Putting genetic interactions in context through a global modular decomposition Jamal.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Spatial statistics Lecture 3.
Introduction to Bioinformatics
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Mutual Information Mathematical Biology Seminar
Evaluation and optimization of clustering in gene expression data analysis A. Fazel Famili, Ganming Liu and Ziying Liu National Research Council of Canada.
Differentially expressed genes
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Clustering (Part II) 11/26/07. Spectral Clustering.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 1 Alberto Bertoni, Giorgio Valentini
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Evaluating Performance for Data Mining Techniques
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.
Lecture 20: Cluster Validation
tch?v=Y6ljFaKRTrI Fireflies.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
CHAPTER 17: Tests of Significance: The Basics
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Cluster validation Integration ICES Bioinformatics.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Flat clustering approaches
Clustering by soft-constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007.
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
1 Relational Factor Graphs Lin Liao Joint work with Dieter Fox.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Semi-Supervised Clustering
Cluster Analysis II 10/03/2012.
Machine Learning Clustering: K-means Supervised Learning
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
Stochastic Methods.
Presentation transcript:

Clustering (Part II) 10/07/09

Outline Affinity propagation Quality evaluation

Affinity propagation: main idea Data points can be exemplar (cluster center) or non-exemplar (other data points). Message is passed between exemplar (centroid) and non-exemplar data points. The total number of clusters will be automatically found by the algorithm.

Responsibility r(j,k) A non-exemplar data point informs each candidate exemplar whether it is suitable for joining as a member. candidate exemplar k data point j

Availability a(j,k) A candidate exemplar data point informs other data points whether it is a good exemplar. candidate exemplar k data point j

Self-availability a(k,k) A candidate exemplar data point evaluates itself whether it is a good exemplar. candidate exemplar k data point j

An iterative procedure Update r(j, k) candidate exemplar k data point j r(j,k) a(j,k’) similarity between i and k

An iterative procedure Update a(j, k) candidate exemplar k data point j r(j’,k) a(j,k)

An iterative procedure Update a(k, k)

Step-by-step affinity propagation

Applications Multi-exon gene detection in mouse. Expression level at different exons within a gene are corregulated among different tissue types. 37 mouse tissues involved. 12 tiling arrays. (Frey et al. 2005)

“Algorithms for unsupervised classification or cluster analysis abound. Unfortunately however, algorithm development seems to be a preferred activity to algorithm evaluation among methodologists. …… No consensus or clear guidelines exist to guide these decisions. Cluster analysis always produces clustering, but whether a pattern observed in the sample data characterizes a pattern present in the population remains an open question. Resampling-based methods can address this last point, but results indicate that most clusterings in microarray data sets are unlikely to reflect reproducible patterns or patterns in the overall population.” -Allison et al. (2006)

Stability of a cluster Motivation: Real clusters should be reproducible under perturbation: adding noise, omission of data, etc. Procedure: Perturb observed data by adding noise. Apply clustering procedure to cluster the perturbed data. Repeat the above procedures, generate a sample of clusters. Global test Cluster-specific tests: R-index, D-index. (McShane et al. 2002)

Where is the “truth”? “ In the context of unsupervised learning, there is no such direct measure of success. It is difficult to ascertain the validity of inference drawn from the output of most unsupervised learning algorithms. One must often resort to heuristic arguments not only for motivating the algorithm, but also for judgments as to the quality of results. This uncomfortable situation has led to heavy proliferation of proposed methods, since effectiveness is a matter of opinion and cannot be verified directly.” Hastie et al. 2001; ESL

Global test Null hypothesis: Data come from a multivariate Gaussian distribution. Procedure: Consider a subspace spanned by top principle components. Estimate distribution of “nearest neighbor” distances Compare observed with simulated data.

R-index If cluster i contains n i objects, then it contains m i = n i *(n i – 1)/2 of pairs. Let c i be the number of pairs that fall in the same cluster for the re-clustered perturbed data. r i = c i /m i measures the robustness of the cluster i. R-index =  i c i /  i m i measures overall stability of a clustering algorithm.

D-index For each cluster, determine the closest cluster for the perturbed data Calculated the average discrepancy between the clusters for the original and perturbed data: omission vs addition. D-index is a summation of all cluster- specific discrepancy.

Applications 16 prostate cancer; 9 benign tumor 6500 genes Use hierarchical clustering to obtain 2,3, and 4 clusters. Questions: are these clusters reliable?

Issues with calculating R and D indices How big is the size of perturbation? How to quantify the significance level? What about nested consistency?

Biclustering

Gene expression conditions genes 1D-approach: To identify condition cluster, all genes are used. But probably only a few genes are differentially expressed. Motivation

Gene expression conditions genes 1D-approach: To identify gene cluster, all conditions are used. But a set of genes may only be expressed under a few conditions. Motivation

Gene expression conditions genes Bi-clustering Objective: To isolate genes that are co- expressed under a specific set of conditions. Motivation

Coupled Two-Way Clustering An iterative procedure involving the following two steps. –Within a cluster of conditions, search for gene clusters. –Using features from a cluster of genes, search for condition clusters. (Getz et al. 2001)

SAMBA – A bipartite graph model V = GenesU = Conditions Tanay et al. 2002

V = GenesU = Conditions E = “respond” = differential expression Tanay et al SAMBA – A bipartite graph model

V = GenesU = Conditions E = “respond” = differential expression Cluster = subgraph (U’, V’, E’) =subset of corregulated genes V’ in conditions U’ Tanay et al SAMBA – A bipartite graph model

SAMBA -- algorithm Goal: Find the “heaviest” subgraphs. H = (U’, V’, E’) Tanay et al. 2002

SAMBA -- algorithm Goal: Find the “heavy” subgraphs. missing edge H = (U’, V’, E’) Tanay et al. 2002

SAMBA -- algorithm p u,v -- probability of edge expected at random p c – probability of edge within cluster Compute a weight score for H. H = (U’, V’, E’) Tanay et al. 2002

SAMBA -- algorithm Finding the heaviest graph is an NP-hard problem. Use a polynomial algorithm to search for minima efficiently. H = (U’, V’, E’) Tanay et al. 2002

Significance of weight Let H = (U’, V’, E’) be a subgraph. Fix U’, random select a new V” with the same size as V’. The weight for the new subgraph (U’, V”, E”) gives a background distribution. Estimate p-value bp comparing log L(H) with the background distribution.

Model evaluation The p-value distribution for the top candidate clusters. If biological classification data are available, evaluate the purity of class membership within each bicluster.

Reading List Frey and Dueck 2007 –Affinity propagation McShine et al –Clustering model evaluation Tanay et al –SAMBA for biclustering