Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik Abidin, and William Perrizo Dept. of Computer Science North Dakota State University Fargo, ND, USA

North Dakota State University 2 Microarray Experiments One of the biggest breakthroughs in the field of genomics for monitoring genes Large amounts of data is being generated Useful in studying co-expressed genes Co-expressed genes:  Genes that exhibit similar expression profiles  Useful in identifying functional categories of a group of genes, how genes interact to form interaction networks

North Dakota State University 3 Objectives of Microarray Data Analysis Class discovery: The goal is to identify the clusters of genes that have similar gene expression profiles over a time series of experiments. Clustering is the main technique employed in class discovery. Class prediction: Assigning an unspecified gene to a class given the expression of other genes with known class labels. Classification is the main technique used in class prediction. Class comparison: Aims at identifying the genes that differ in expression profiles between different classes of genes.

North Dakota State University 4 In microarray data analysis, genes that exhibit similar gene expression profile or similar patterns of expression will be clustered together. Pearson’s correlation coefficient is used to calculate the similarity of two genes in this work. The higher the coefficient, the greater the similarity

North Dakota State University 5 Related works Partition-based clustering: Given a database of n objects and k number of clusters, the objects are organized into k disjoint partitions, each partition representing a cluster K-means and K-medoids Hierarchical clustering: Agglomerative and divisive based on the construction of hierarchy AGNES, DIANA Density-based clustering: Discovers clusters of arbitrary shapes and effectively filters out noise based on the notion of density and connectivity DBSCAN, OPTICS

North Dakota State University 6 Limitations Partition-based clustering:  Needs k, the number of clusters required a priori  Almost always produce spherical clusters Hierarchical clustering:  Highly depend on the execution of the merge or split decision of the branches and can not be undone, thus leading to low clustering quality Density-based clustering:  Cannot find clusters if density varies significantly from region to region  scalability to large data sets is a problem

North Dakota State University 7 The Proposed Clustering Algorithm Address the problem of  a priori knowledge of clusters,  clusters of arbitrary shape in high dimensional data Based on the notion of density and shared nearest neighbor measure Uses P-tree 1 technology for efficient data mining 1 P-tree technology is patented by NDSU. United States Patent No. 6,941,303

North Dakota State University 8 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 = 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 P-tree overview Predicate tree technology: vertically project each attribute vertically project each bit position of each attribute, compress each bit slice into a basic P-tree Basic logical operations can be performed on the P-trees

North Dakota State University 9 Definitions density (g i ) = where n=number of neighbors of g i ≥ similarity threshold  Used in the identification of core genes shared nearest neighbors measure snn (g i, g j ) = size ( N N (g i ) ∩ N N (g j ) ) where NN (g i ) and NN (g j ) represent the nearest neighbor lists of genes g i and g j respectively shared nearest neighbors in P-tree form psnn ( g i, g j ) = rootCount (N Nm (g i )  N Nm (g j ) ) where NNm (g i ) and NNm (g j ) are the nearest neighbor masks of genes g i and g j respectively.

North Dakota State University 10 Definitions Core gene: The gene with highest density and the number of neighbors greater than zero Border gene: If the neighbors of a gene, g i has at least one gene with higher density than g i, then it is considered as a border point Noise: If the number of neighbors of a gene is zero, we consider that gene as noise

North Dakota State University 11 Clustering procedure Identify two genes with highest density (core genes) Find the nearest neighbors of both the genes Check to see if they share neighbors > threshold  If they share, then assign all the neighbors of both the genes into the same cluster (bulk assignment) If not, then check each gene separately  If it a core gene, then process its neighbors and place them in a cluster  If not a core gene, then it is a border gene Noise genes can be identified if there are no neighbors

North Dakota State University 12 ClusHDS: The Clustering Algorithm Assigning the border points to clusters 1000111101010001111010 0111001010101110010101 1000110101010001101010 1000011100010000111000 0011001100100110011001 1101010001011010100010 1101010010111010100101 NNmB C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 AND with each of the cluster masks Root count of AND operation 2 1 5 4 2 3 1000111101010001111010 1000010000010000100000 0000001000000000010000 1000110101010001101010 1000011100010000111000 0000001100000000011000 1000010100010000101000 1000111101010001111010

North Dakota State University 13 Parameters Similarity threshold:  Highly co-expressed genes are desirable in microarray experiments. Hence, the higher the similarity, the compact are the clusters with genes having similar function Shared nearest neighbor threshold:  Determines the size of the clusters.  If too large, then clusters will have more genes  If too small, the clusters with uniform density may be broken into several small tight clusters.  Hence, domain knowledge is highly required

North Dakota State University 14 Clusters obtained from Iyer’s dataset with similarity ≥ 0.90, snn = 20, time = 5.7 sec Gene expression profiles in clusters by ClusHSD Iyer’s data set: Contains 517 genes with expression levels measured at 12 time points

North Dakota State University 15 Conclusion and future work Presented a new clustering algorithm based on density and shared nearest neighbors Automatically determines the number of clusters and identify clusters of arbitrary shapes and sizes Improved performance due to P-trees  no DB scans  AND and OR operations Future work:  Interactive sub-clustering based on different snnThreshold  explore the possibilities of automatically determining the snnThreshold based on the data set.

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Similar presentations

Presentation on theme: "Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.

Similar presentations

Presentation on theme: "Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik."— Presentation transcript:

Similar presentations

About project

Feedback