Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Slides:

Advertisements

Similar presentations

Hierarchical Clustering, DBSCAN The EM Algorithm

Advertisements

Clustering Basic Concepts and Algorithms

PARTITIONAL CLUSTERING

Han-na Yang Trace Clustering in Process Mining M. Song, C.W. Gunther, and W.M.P. van der Aalst.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲.

Clustering Prof. Navneet Goyal BITS, Pilani

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

Chapter 3: Cluster Analysis

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.

Cluster Analysis.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.

Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.

Data Mining Chun-Hung Chou

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:

TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.

Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.

K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.

Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 Dept. Computer Science and Information Engineering.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.

Clustering By : Babu Ram Dawadi. 2 Clustering cluster is a collection of data objects, in which the objects similar to one another within the same cluster.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

What Is Cluster Analysis?

Data Mining: Basic Cluster Analysis

CSE 4705 Artificial Intelligence

North Dakota State University Fargo, ND USA

Yue (Jenny) Cui and William Perrizo North Dakota State University

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

CSE572, CBS598: Data Mining by H. Liu

Vertical K Median Clustering

A Fast and Scalable Nearest Neighbor Based Classification

CS 685: Special Topics in Data Mining Jinze Liu

CS 485G: Special Topics in Data Mining

Vertical K Median Clustering

DATA MINING Introductory and Advanced Topics Part II - Clustering

North Dakota State University Fargo, ND USA

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

GPX: Interactive Exploration of Time-series Microarray Data

CSE572, CBS572: Data Mining by H. Liu

Vertical K Median Clustering

North Dakota State University Fargo, ND USA

Taufik Abidin and William Perrizo

CSE572: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Presentation transcript:

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik Abidin, and William Perrizo Dept. of Computer Science North Dakota State University Fargo, ND, USA

North Dakota State University 2 Microarray Experiments One of the biggest breakthroughs in the field of genomics for monitoring genes Large amounts of data is being generated Useful in studying co-expressed genes Co-expressed genes:  Genes that exhibit similar expression profiles  Useful in identifying functional categories of a group of genes, how genes interact to form interaction networks

North Dakota State University 3 Objectives of Microarray Data Analysis Class discovery: The goal is to identify the clusters of genes that have similar gene expression profiles over a time series of experiments. Clustering is the main technique employed in class discovery. Class prediction: Assigning an unspecified gene to a class given the expression of other genes with known class labels. Classification is the main technique used in class prediction. Class comparison: Aims at identifying the genes that differ in expression profiles between different classes of genes.

North Dakota State University 4 In microarray data analysis, genes that exhibit similar gene expression profile or similar patterns of expression will be clustered together. Pearson’s correlation coefficient is used to calculate the similarity of two genes in this work. The higher the coefficient, the greater the similarity

North Dakota State University 5 Related works Partition-based clustering: Given a database of n objects and k number of clusters, the objects are organized into k disjoint partitions, each partition representing a cluster K-means and K-medoids Hierarchical clustering: Agglomerative and divisive based on the construction of hierarchy AGNES, DIANA Density-based clustering: Discovers clusters of arbitrary shapes and effectively filters out noise based on the notion of density and connectivity DBSCAN, OPTICS

North Dakota State University 6 Limitations Partition-based clustering:  Needs k, the number of clusters required a priori  Almost always produce spherical clusters Hierarchical clustering:  Highly depend on the execution of the merge or split decision of the branches and can not be undone, thus leading to low clustering quality Density-based clustering:  Cannot find clusters if density varies significantly from region to region  scalability to large data sets is a problem

North Dakota State University 7 The Proposed Clustering Algorithm Address the problem of  a priori knowledge of clusters,  clusters of arbitrary shape in high dimensional data Based on the notion of density and shared nearest neighbor measure Uses P-tree 1 technology for efficient data mining 1 P-tree technology is patented by NDSU. United States Patent No. 6,941,303

North Dakota State University R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) = P-tree overview Predicate tree technology: vertically project each attribute vertically project each bit position of each attribute, compress each bit slice into a basic P-tree Basic logical operations can be performed on the P-trees

North Dakota State University 9 Definitions density (g i ) = where n=number of neighbors of g i ≥ similarity threshold  Used in the identification of core genes shared nearest neighbors measure snn (g i, g j ) = size ( N N (g i ) ∩ N N (g j ) ) where NN (g i ) and NN (g j ) represent the nearest neighbor lists of genes g i and g j respectively shared nearest neighbors in P-tree form psnn ( g i, g j ) = rootCount (N Nm (g i )  N Nm (g j ) ) where NNm (g i ) and NNm (g j ) are the nearest neighbor masks of genes g i and g j respectively.

North Dakota State University 10 Definitions Core gene: The gene with highest density and the number of neighbors greater than zero Border gene: If the neighbors of a gene, g i has at least one gene with higher density than g i, then it is considered as a border point Noise: If the number of neighbors of a gene is zero, we consider that gene as noise

North Dakota State University 11 Clustering procedure Identify two genes with highest density (core genes) Find the nearest neighbors of both the genes Check to see if they share neighbors > threshold  If they share, then assign all the neighbors of both the genes into the same cluster (bulk assignment) If not, then check each gene separately  If it a core gene, then process its neighbors and place them in a cluster  If not a core gene, then it is a border gene Noise genes can be identified if there are no neighbors

North Dakota State University 12 ClusHDS: The Clustering Algorithm Assigning the border points to clusters NNmB C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 AND with each of the cluster masks Root count of AND operation

North Dakota State University 13 Parameters Similarity threshold:  Highly co-expressed genes are desirable in microarray experiments. Hence, the higher the similarity, the compact are the clusters with genes having similar function Shared nearest neighbor threshold:  Determines the size of the clusters.  If too large, then clusters will have more genes  If too small, the clusters with uniform density may be broken into several small tight clusters.  Hence, domain knowledge is highly required

North Dakota State University 14 Clusters obtained from Iyer’s dataset with similarity ≥ 0.90, snn = 20, time = 5.7 sec Gene expression profiles in clusters by ClusHSD Iyer’s data set: Contains 517 genes with expression levels measured at 12 time points

North Dakota State University 15 Conclusion and future work Presented a new clustering algorithm based on density and shared nearest neighbors Automatically determines the number of clusters and identify clusters of arbitrary shapes and sizes Improved performance due to P-trees  no DB scans  AND and OR operations Future work:  Interactive sub-clustering based on different snnThreshold  explore the possibilities of automatically determining the snnThreshold based on the data set.