Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thanh Le, Katheleen J. Gardiner University of Colorado Denver

Similar presentations


Presentation on theme: "Thanh Le, Katheleen J. Gardiner University of Colorado Denver"— Presentation transcript:

1 A validation method for fuzzy clustering A biological problem of gene expression data
Thanh Le, Katheleen J. Gardiner University of Colorado Denver July 18th, 2011

2 Overview Introduction fzBLE Datasets Experimental results Discussion:
Data clustering: approaches and current challenges fzBLE a novel method for validation of clustering results Datasets artificial and real datasets for testing fzBLE Experimental results Discussion: Advantages and limitations of fzBLE

3 Clustering problem Genes are clustered based on
Similarity Dissimilarity Clusters are described by Boundaries & overlaps Number of clusters Compactness within clusters Separation between clusters

4 Clustering approaches
Hierarchical approach Partitioning approach Hard clustering approach Crisp cluster boundaries Crisp cluster membership Soft/Fuzzy clustering approach Overlapping cluster boundaries Soft/Fuzzy membership Appropriate for many real-world problems

5 Fuzzy C-Means algorithm
The model Features: Fuzzy membership, soft cluster boundaries, One gene can belong to multiple clusters & be assigned to multiple biological processes

6 Fuzzy C-Means (contd.) Possibility-based model Model parameters estimated using an iteration process Rapid convergence Most appropriate for gene expression data Challenges: Determining the number of clusters Avoiding local optima The goodness-of-fit to validate clustering results

7 Methods for fuzzy clustering validation
Methods based on compactness and separation Problem: Over-fit - the larger the number of cluster is, the better the cluster index is. No rationale for how to scale the two factors in the model Methods based on goodness of fit Statistics approach Expectation-Maximization (EM) method Slowly convergent, particularly at cluster boundaries because of the exponential function. Inappropriate to real dataset because of the model assumption of data distributions: Gaussian, chi-squared…

8 The fzBLE method for cluster validation
Cluster using Fuzzy C-Means clustering algorithm Validate using the goodness-of-fit (the log likelihood estimator) and Bayesian approach

9 Cluster validation: Goodness-of-fit & fuzzy clustering
Convert the possibility model into a probability model Use Bayesian approach to compute the statistics. Apply the Central Limit Theory To effectively represent the data distribution Model selection based on goodness-of-fit

10 Datasets Artificial datasets Real datasets
Finite mixture model based datasets Real datasets Iris, Wine and Glass datasets at UC Irvine Machine Learning Repository Gene datasets which are more complex Yeast cell cycle gene expression (Yeast) Yeast gene functional annotations (Yeast-MIPS) Rat Central Nervous System (RCNS) gene expression

11 Experimental results on artificial datasets
Correctness Ratios in determining the number of clusters # clusters fzBLE PC PE FS XB CWB PBMF BR CF 3 1.00 0.42 0.83 0.00 4 0.92 5 0.75 6 0.58 7 0.67 8 9 0.33 PC-partition coefficient, PE-partition entropy, FS-Fukuyama-Sugeno, XB-Xie and Beni, CWB-Compose Within and Between scattering, PBMF-Pakhira, Bandyopadhyay and Maulik Fuzzy, BR-Rezaee B., CF-Compactness factor; loop=5, #cluster range=[2,12]

12 Experimental results on Glass dataset
Algorithm Cluster Validity Scores and Decisions (highlighted in yellow) # clusters fzble PC PE FS XB CWB PBMF BR CF 2 0.8884 0.1776 0.3700 0.7222 0.3732 1.9817 0.5782 3 0.8386 0.2747 0.1081 0.7817 0.4821 1.5004 0.4150 4 0.8625 0.2515 0.6917 0.4463 1.0455 0.3354 5 0.8577 0.2698 0.6450 0.4610 0.8380 0.2818 6 0.8004 0.3865 1.4944 0.3400 0.8371 0.2430 7 0.8183 0.3650 1.3802 0.3891 0.6914 0.2214 8 0.8190 0.3637 1.4904 0.6065 0.5916 0.2108 9 0.8119 0.3925 1.7503 0.3225 0.5634 0.1887 10 0.8161 0.3852 1.7821 0.3909 0.4926 0.1758 11 0.8259 0.3689 1.6260 0.3265 0.4470 0.1704 12 0.8325 0.3555 1.4213 0.5317 0.3949 0.1591 13 0.8317 0.3556 1.4918 0.6243 0.3544 0.1472

13 Experimental results on RCNS - more complex dataset; two-factor scaling issue
Algorithm Cluster Validity Scores and Decisions (highlighted in yellow) #clusters fzble PC PE FS XB CWB PBMF BR CF 2 0.9942 0.0121 0.0594 5.5107 4.2087 1.1107 3 0.9430 0.0942 0.4877 4.1309 4.2839 1.6634 4 0.9142 0.1470 0.9245 6.1224 3.3723 1.3184 5 0.8900 0.1941 1.3006 9.4770 2.6071 1.1669 6 0.8695 0.2387 2.5231 1.9499 1.1026 7 0.8707 0.2386 2.1422 2.8692 0.7875 8 0.8925 0.2078 1.7245 2.5323 0.5894 9 0.8863 0.2192 1.6208 2.6041 0.5019 10 0.8847 0.2241 1.1897 3.4949 0.3918 112 genes during RCNS development at 9 time points 6 clusters, 4 of which are functionality-annotated (Somogyi et al. 1995, Wen et al. 1998)

14 Discussion: The advantages of fzBLE
Performs better than other approaches on 3 levels of data. Compactness-separation approaches Solves the over-fit problem using goodness-of-fit. Eliminates need for two scaling factors Mixture model with EM approach Rapid convergence No assumption on data distribution The approach of scaling the two factors: compactness and separation is similar to that of scaling gene expression by within condition before clustering. The problem is that: The number of genes in each chip is known while we are not sure the number of clusters The values in multiple experimental conditions are consistent (fc, log of fc,…) while the values of the two factor are not.

15 Discussion: The limitations of fzBLE
Depends on internal validity External validities are needed Biological validity GO terms, Pathways, PPI Future work on gene expression: Distance definition based on biological context Combine fzBLE with biological homology and stability indices

16 Thank you! Questions? We acknowledge the support from
National Institutes of Health Linda Crnic Institute Vietnamese Ministry of Education and Training


Download ppt "Thanh Le, Katheleen J. Gardiner University of Colorado Denver"

Similar presentations


Ads by Google