Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

K-means Clustering Given a data point v and a set of points X,
Cluster Analysis: Basic Concepts and Algorithms
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Introduction to Bioinformatics
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Introduction to Bioinformatics Algorithms Clustering.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering Color/Intensity
Cluster Analysis: Basic Concepts and Algorithms
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
Introduction to Bioinformatics Algorithms Clustering.
CSE182-L17 Clustering Population Genetics: Basics.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
What is Cluster Analysis?
Clustering a.j.m.m. (ton) weijters The main idea is to define k centroids, one for each cluster (Example from a K-clustering tutorial of Teknomo, K.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Radial Basis Function Networks
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gene expression & Clustering (Chapter 10)
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
K-MEANS CLUSTERING. INTRODUCTION- What is clustering? Clustering is the classification of objects into different groups, or more precisely, the partitioning.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Data Mining – Algorithms: K Means Clustering
Semi-Supervised Clustering
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining K-means Algorithm
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Problem Definition Input: Output: Requirement:
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
Clustering.
Presentation transcript:

Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University

Perform a cluster analysis on gene expression profiles

Perform a cluster analysis on gene expression profiles by computing the Pearson correlation coefficient

Hierarchical Clustering Method We continue this process, clustering 1 with 4, then {2,3} with 5. The resulting hierarchy takes the form

K-Means Clustering Problem: Formulation Input: A set, V, consisting of n points and a parameter k Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X

1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x

K-Means Clustering Problem: Formulation The basic step of k-means clustering is simple: Iterate until stable (= no object move group): 1.Determine the centroid coordinate 2.Determine the distance of each object to the centroids 3.Group the object based on minimum distance Ref: di/tutorial/kMean/NumericalExampl e.htmhttp:// di/tutorial/kMean/NumericalExampl e.htm

K-Means Clustering Problem: Formulation Suppose we have several objects (4 types of medicines) and each object have two attributes or features as shown in table below. Our goal is to group these objects into K=2 group of medicine based on the two features (pH and weight index). Object attribute 1 (X): attribute 2 (Y): weight index pH Medicine A 1 1 Medicine B 2 1 Medicine C 4 3 Medicine D 5 4 Each medicine represents one point with two attributes (X, Y) that we can represent it as coordinate in an attribute space as shown in the figure on the right.

K-Means Clustering Problem: Formulation 1. Initial value of centroids : Suppose we use medicine A and medicine B as the first centroids. Let C 1 and C 2 denote the coordinate of the centroids, then C 1 =(1,1) and C 2 =(2,1).

K-Means Clustering Problem: Formulation

x1x1 x2x2 x3x3

x1x1 x2x2 x3x3

x1x1 x2x2 x3x3

x1x1 x2x2 x3x3

1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x 1-Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center. An efficient heuristic (learn by discovering things 探索法 ) method for K-Means clustering is the Lloyd algorithm Perform two steps until either it converges to until the fluctuations become very small Assign each data point to the cluster C, corresponding to the closest cluster representative xi (1 ≦ i ≦ k) After the assignments of all n data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is for every cluster C

K-Means Clustering Problem: Formulation Similar to other algorithm, K-mean clustering has many weaknesses: When the numbers of data are not so many, initial grouping will determine the cluster significantly. The number of cluster, k, must be determined before hand. We never know the real cluster, using the same data, if it is inputted in a different order may produce different cluster if the number of data is a few. Sensitive to initial condition. Different initial condition may produce different result of cluster. The algorithm may be trapped in the local optimum. We never know which attribute contributes more to the grouping process since we assume that each attribute has the same weight. Weakness of arithmetic mean is not robust to outliers. Very far data from the centroid may pull the centroid away from the real one. The result is circular cluster shape because based on distance.distance