Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
K-means Clustering Given a data point v and a set of points X,
A New Algorithm of Fuzzy Clustering for Data with Uncertainties: Fuzzy c-Means for Data with Tolerance Defined as Hyper-rectangles ENDO Yasunori MIYAMOTO.
Unsupervised Learning
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
PROBABILISTIC DISTANCE MEASURES FOR PROTOTYPE-BASED RULES Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland, School of.
Machine Learning and Data Mining Clustering
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Mutual Information Mathematical Biology Seminar
Data Mining Techniques Outline
Fuzzy rule-based system derived from similarity to prototypes Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland School.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering Color/Intensity
Clustering.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction to machine learning
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
CPSC 386 Artificial Intelligence Ellen Walker Hiram College
Inductive learning Simplest form: learn a function from examples
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Ensembles of Partitions via Data Resampling
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Linear Models for Classification
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Cluster validation Integration ICES Bioinformatics.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Flat clustering approaches
A new initialization method for Fuzzy C-Means using Fuzzy Subtractive Clustering Thanh Le, Tom Altman University of Colorado Denver July 19, 2011.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Date: 2011/1/11 Advisor: Dr. Koh. Jia-Ling Speaker: Lin, Yi-Jhen Mr. KNN: Soft Relevance for Multi-label Classification (CIKM’10) 1.
A Comparison of Resampling Methods for Clustering Ensembles
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Thanh Le, Katheleen J. Gardiner University of Colorado Denver
Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819
Fuzzy Pattern Recognition. Overview of Pattern Recognition Pattern Recognition Procedure Feature Extraction Feature Reduction Classification (supervised)
Fuzzy C-means Clustering Dr. Bernard Chen University of Central Arkansas.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Fuzzy Logic in Pattern Recognition
Machine Learning Logistic Regression
Feature Selection for Pattern Recognition
Machine Learning Logistic Regression
Introduction to Machine learning
Presentation transcript:

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University of Colorado Denver April 16, 2012

Overview Introduction Data clustering with missing values Current approaches Proposed method: fzPBI Data clustering using Fuzzy C-Means Imputation using probability model Datasets Artificial and real datasets for testing fzPBI Experimental results Discussion

Clustering with missing values Data points & missing values x(x 1, x 2, …, x n-1, x n ) Data points with missing values, x(x 1, ?, …, ?, x n ) X M = { ? }; X P = X \ X M Problem Cluster analysis is based on dissimilarity Distance is computed using every attribute of data objects. Improper distance measurement provides incorrect clustering results.

Current approaches Data preprocess to predict missing values Remove data points with missing values Imputation of missing values During the clustering process Application of clustering model Missing values are estimated and used Popular clustering methods, Expectation-Maximization (EM), Model based clustering K-Means Crisp membership Fuzzy C-Means (FCM) Fuzzy membership, soft cluster boundaries Each data point can belong to multiple clusters, more relationship information provided

Current approaches’ issues Heuristic methods Imputation using nearest data points Heuristics, data distribution is not used EM based methods Model based imputation of missing values Model assumptions, slow convergence Missing values impact parameter estimation FCM based methods Distance based imputation of missing values Fast convergence, maybe the best approach Data distribution is omitted

Probability-based imputation - fzPBI 1. Data clustering using FCM 2. Possibility to probability transformation 3. Application of the central limit theory into creation of the probability model of data distribution 4. Application of the probability model into missing value imputation 5. Repeat steps 1-4 until convergence

Fuzzy C-Means algorithm Objective function Model parameters estimation:

Distance measurement p: Data space dimensions Each missing value, x ij, is used with confidence, w j, which is, 0 at the beginning 1 at the end

Probability model Central limit theory application, Cluster is the mean of different distribution models that describe the cluster’s members. It can be approximated using the normal distribution model. Possibility to probability transformation {u ki } i=1..n - possibility distribution of X at v k {p ki } i=1..n - probability distribution of X at v k, Create the probability model at v k using {p ki } Missing value imputation using probability model

Datasets Artificial datasets A dataset generated using finite mixture model A non-uniform dataset manually created Clusters differ in size Cluster distances are different Real datasets Iris, Wine datasets at UC Irvine Machine Learning Repository RCNS (Rat central nervous system), Serum, Yeast and Yeast-MIPS gene expression datasets. Incomplete datasets were generated using different percentages of missing values

Performance measures Root mean square error – RMSE Misclassification error - ME Compare the cluster label of each data object with its actual class label

Uniform dataset fzPBI- Probability based method OCS- optimal complete strategy NPS- nearest prototype strategy FCMimp- FCM based impute CIAO- Alternating Optimization FCMGOimp- FCM & GO based impute

Non-uniform dataset

Iris dataset

RCNS gene expression dataset

Yeast gene expression dataset

Serum gene expression dataset

The advantages of fzPBI Approximate the data distribution using probability model Apply the model into missing value imputation Inherit the advantages of FCM and model based methods, and the application of the central limit theory

Future work Combine fzPBI with biological knowledge: protein-protein-interaction, Gene ontology Internal measures using the data External measures using the biological knowledge Internal measures at missing values are adjusted using external measures.

Thank you! Questions?  We acknowledge the support from Vietnamese Ministry of Education and Training, the 322 scholarship program.