Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University.

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University of Colorado Denver April 16, 2012

Overview Introduction Data clustering with missing values Current approaches Proposed method: fzPBI Data clustering using Fuzzy C-Means Imputation using probability model Datasets Artificial and real datasets for testing fzPBI Experimental results Discussion

Clustering with missing values Data points & missing values x(x 1, x 2, …, x n-1, x n ) Data points with missing values, x(x 1, ?, …, ?, x n ) X M = { ? }; X P = X \ X M Problem Cluster analysis is based on dissimilarity Distance is computed using every attribute of data objects. Improper distance measurement provides incorrect clustering results.

Current approaches Data preprocess to predict missing values Remove data points with missing values Imputation of missing values During the clustering process Application of clustering model Missing values are estimated and used Popular clustering methods, Expectation-Maximization (EM), Model based clustering K-Means Crisp membership Fuzzy C-Means (FCM) Fuzzy membership, soft cluster boundaries Each data point can belong to multiple clusters, more relationship information provided

Current approaches’ issues Heuristic methods Imputation using nearest data points Heuristics, data distribution is not used EM based methods Model based imputation of missing values Model assumptions, slow convergence Missing values impact parameter estimation FCM based methods Distance based imputation of missing values Fast convergence, maybe the best approach Data distribution is omitted

Probability-based imputation - fzPBI 1. Data clustering using FCM 2. Possibility to probability transformation 3. Application of the central limit theory into creation of the probability model of data distribution 4. Application of the probability model into missing value imputation 5. Repeat steps 1-4 until convergence

Fuzzy C-Means algorithm Objective function Model parameters estimation:

Distance measurement p: Data space dimensions Each missing value, x ij, is used with confidence, w j, which is, 0 at the beginning 1 at the end

Probability model Central limit theory application, Cluster is the mean of different distribution models that describe the cluster’s members. It can be approximated using the normal distribution model. Possibility to probability transformation {u ki } i=1..n - possibility distribution of X at v k {p ki } i=1..n - probability distribution of X at v k, Create the probability model at v k using {p ki } Missing value imputation using probability model

Datasets Artificial datasets A dataset generated using finite mixture model A non-uniform dataset manually created Clusters differ in size Cluster distances are different Real datasets Iris, Wine datasets at UC Irvine Machine Learning Repository RCNS (Rat central nervous system), Serum, Yeast and Yeast-MIPS gene expression datasets. Incomplete datasets were generated using different percentages of missing values

Performance measures Root mean square error – RMSE Misclassification error - ME Compare the cluster label of each data object with its actual class label

Uniform dataset fzPBI- Probability based method OCS- optimal complete strategy NPS- nearest prototype strategy FCMimp- FCM based impute CIAO- Alternating Optimization FCMGOimp- FCM & GO based impute

Non-uniform dataset

Iris dataset

RCNS gene expression dataset

Yeast gene expression dataset

Serum gene expression dataset

The advantages of fzPBI Approximate the data distribution using probability model Apply the model into missing value imputation Inherit the advantages of FCM and model based methods, and the application of the central limit theory

Future work Combine fzPBI with biological knowledge: protein-protein-interaction, Gene ontology Internal measures using the data External measures using the biological knowledge Internal measures at missing values are adjusted using external measures.

Thank you! Questions?  We acknowledge the support from Vietnamese Ministry of Education and Training, the 322 scholarship program.

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University.

Similar presentations

Presentation on theme: "Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University.

Similar presentations

Presentation on theme: "Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University."— Presentation transcript:

Similar presentations

About project

Feedback