Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University.

Similar presentations


Presentation on theme: "Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University."— Presentation transcript:

1 Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University of Colorado Denver April 16, 2012

2 Overview Introduction Data clustering with missing values Current approaches Proposed method: fzPBI Data clustering using Fuzzy C-Means Imputation using probability model Datasets Artificial and real datasets for testing fzPBI Experimental results Discussion

3 Clustering with missing values Data points & missing values x(x 1, x 2, …, x n-1, x n ) Data points with missing values, x(x 1, ?, …, ?, x n ) X M = { ? }; X P = X \ X M Problem Cluster analysis is based on dissimilarity Distance is computed using every attribute of data objects. Improper distance measurement provides incorrect clustering results.

4 Current approaches Data preprocess to predict missing values Remove data points with missing values Imputation of missing values During the clustering process Application of clustering model Missing values are estimated and used Popular clustering methods, Expectation-Maximization (EM), Model based clustering K-Means Crisp membership Fuzzy C-Means (FCM) Fuzzy membership, soft cluster boundaries Each data point can belong to multiple clusters, more relationship information provided

5 Current approaches’ issues Heuristic methods Imputation using nearest data points Heuristics, data distribution is not used EM based methods Model based imputation of missing values Model assumptions, slow convergence Missing values impact parameter estimation FCM based methods Distance based imputation of missing values Fast convergence, maybe the best approach Data distribution is omitted

6 Probability-based imputation - fzPBI 1. Data clustering using FCM 2. Possibility to probability transformation 3. Application of the central limit theory into creation of the probability model of data distribution 4. Application of the probability model into missing value imputation 5. Repeat steps 1-4 until convergence

7 Fuzzy C-Means algorithm Objective function Model parameters estimation:

8 Distance measurement p: Data space dimensions Each missing value, x ij, is used with confidence, w j, which is, 0 at the beginning 1 at the end

9 Probability model Central limit theory application, Cluster is the mean of different distribution models that describe the cluster’s members. It can be approximated using the normal distribution model. Possibility to probability transformation {u ki } i=1..n - possibility distribution of X at v k {p ki } i=1..n - probability distribution of X at v k, Create the probability model at v k using {p ki } Missing value imputation using probability model

10 Datasets Artificial datasets A dataset generated using finite mixture model A non-uniform dataset manually created Clusters differ in size Cluster distances are different Real datasets Iris, Wine datasets at UC Irvine Machine Learning Repository RCNS (Rat central nervous system), Serum, Yeast and Yeast-MIPS gene expression datasets. Incomplete datasets were generated using different percentages of missing values

11 Performance measures Root mean square error – RMSE Misclassification error - ME Compare the cluster label of each data object with its actual class label

12 Uniform dataset fzPBI- Probability based method OCS- optimal complete strategy NPS- nearest prototype strategy FCMimp- FCM based impute CIAO- Alternating Optimization FCMGOimp- FCM & GO based impute

13 Non-uniform dataset

14 Iris dataset

15 RCNS gene expression dataset

16 Yeast gene expression dataset

17 Serum gene expression dataset

18 The advantages of fzPBI Approximate the data distribution using probability model Apply the model into missing value imputation Inherit the advantages of FCM and model based methods, and the application of the central limit theory

19 Future work Combine fzPBI with biological knowledge: protein-protein-interaction, Gene ontology Internal measures using the data External measures using the biological knowledge Internal measures at missing values are adjusted using external measures.

20 Thank you! Questions?  We acknowledge the support from Vietnamese Ministry of Education and Training, the 322 scholarship program.


Download ppt "Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University."

Similar presentations


Ads by Google