K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.

Slides:

Advertisements

Similar presentations

Advertisements

Cluster Analysis: Basic Concepts and Algorithms

Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Basic Concepts and Algorithms

PARTITIONAL CLUSTERING

Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.

DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.

Qiang Yang Adapted from Tan et al. and Han et al.

Clustering Prof. Navneet Goyal BITS, Pilani

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.

Part II - Clustering© Prentice Hall1 Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Most clustering.

Clustering Methods Professor: Dr. Mansouri

More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Cluster Analysis.

An Introduction to Clustering

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

SCAN: A Structural Clustering Algorithm for Networks

Cluster Analysis.

Cluster Analysis: Basic Concepts and Algorithms

What is Cluster Analysis?

What is Cluster Analysis?

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?

DATA MINING LECTURE 8 Clustering The k-means algorithm

Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?

Clustering Unsupervised learning Generating “classes”

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.

Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.

Topic9: Density-based Clustering

Unsupervised Learning. Supervised learning vs. unsupervised learning.

DBSCAN Data Mining algorithm Dr Veljko Milutinović Milan Micić

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Presented by Ho Wai Shing

Density-Based Clustering Methods. Clustering based on density (local cluster criterion), such as density-connected points Major features: –Discover clusters.

Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.

CSSE463: Image Recognition Day 23 Midterm behind us… Midterm behind us… Foundations of Image Recognition completed! Foundations of Image Recognition completed!

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.

Other Clustering Techniques

CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Marko Živković 3179/2015.  Clustering is the process of grouping large data sets according to their similarity  Density-based clustering: ◦ groups together.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.

1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Clustering Anna Reithmeir Data Mining Proseminar 2017

Data Mining: Basic Cluster Analysis

More on Clustering in COSC 4335

Hierarchical Clustering: Time and Space requirements

CS 685: Special Topics in Data Mining Jinze Liu

CSSE463: Image Recognition Day 23

CS 685: Special Topics in Data Mining Jinze Liu

CS 485G: Special Topics in Data Mining

CSE572, CBS572: Data Mining by H. Liu

Gyozo Gidofalvi Uppsala Database Laboratory

CSE572: Data Mining by H. Liu

CS 685: Special Topics in Data Mining Jinze Liu

Presentation transcript:

k-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory

Erik Zeitler2 k-Means  Input M (set of points) k (number of clusters)  Output µ 1, …, µ k (cluster centroids)  k-Means clusters the M point into K clusters by minimizing the squared error function clusters S i ; i=1, …, k. µ i is the centroid of all x j  S i.

Erik Zeitler3 k-Means algorithm select (m 1 … m K ) randomly from M % initial centroids do (µ 1 … µ K ) = (m 1 … m K ) all clusters C i = {} for each point p in M % compute cluster membership of p [µ i, i] = min(dist(µ, p)) % assign p to the corresponding cluster: C i = C i  {p} end for each cluster C i % recompute the centroids m i = avg(x in C i ) while exists m i  µ i % convergence criterion

Erik Zeitler4 K-Means on three clusters

Erik Zeitler5 I’m feeling Unlucky Bad initial points

Erik Zeitler6 Eat this! Non-spherical clusters

Erik Zeitler7 k-Means in practice  How to choose initial centroids select randomly among the data points generate completely randomly  How to choose k study the data run k-Means for different k  measure squared error for each k  Run k-Means many times! Get many choices of initial points

Erik Zeitler8 k-Means pros and cons +- Easy Fast Works only for ”well- shaped” clusters Scalable?Sensitive to outliers Sensitive to noise Must know k a priori

Erik Zeitler9 Questions  Euclidean distance results in spherical clusters What cluster shape does the Manhattan distance give? Think of other distance measures too. What cluster shapes will those yield?  Assuming that the K-means algorithm converges in I iterations, with N points and X features for each point give an approximation of the complexity of the algorithm expressed in K, I, N, and X.  Can the K-means algorithm be parallellized? How?

Erik Zeitler10 DBSCAN  Density Based Spatial Clustering of Applications with Noise  Basic idea: If an object p is density connected to q,  then p and q belong to the same cluster If an object is not density connected to any other object  it is considered noise

Erik Zeitler11   -neigborhood The neighborhood within a radius  of an object  core object An object is a core object iff there are more than MinPts objects in its  - neighbourhood  directly density reachable (ddr) An object p is ddr from q iff q is a core object and p is inside the  neighbourhood of q Definitions p q 

Erik Zeitler12  density reachable (dr) An object q is dr from p iff there exists a chain of objects p 1 … p n s.t. - p 1 is ddr from p, - p 2 is ddr from p 1, - p 3 is ddr from … and q is ddr from p n  density connected (dc) p is dc to q iff - exist an object o such that p is dr from o - and q is dr from o Reachability and Connectivity q p p1p1 p2p2 o q p

Erik Zeitler13 Recall…  Basic idea: If an object p is density connected to q,  then p and q belong to the same cluster If an object is not density connected to any other object  it is considered noise

Erik Zeitler14 DBSCAN, take 1 i = 0 do take a point p from M find the set of points P which are density connected to p if P = {} M = M \ {p} else C i =P i=i+1 M = M \ P end while M  {} HOW? p

Erik Zeitler15 DBSCAN, take 2 i = 0 find the set of core points CP in M do take a point p from CP find the set of points P which are density reachable to p if P = {} M = M \ {p} else C i =P i=i+1 M = M \ P end while M  {} HOW? Let’s call this function dr(p) p

Erik Zeitler16 Implement dr create function dr(p) -> P C = {p} P = {p} do remove a point p' from C find all points X that are ddr(p') C = C  (X \ (P  X)) % add newly discovered % points to C P = P  X while C ≠ {} result P p’

Erik Zeitler17 DBSCAN pros and cons +- Clusters of arbitrary shape Robust to noise Requires connected regions of sufficiently high density Does not need an a priori k Data sets with varying densities are problematic Scalable?

Erik Zeitler18 Questions  Why is the dc criterion useful to define a cluster, instead of dr or ddr?  For which points are density reachable symmetric? i.e. for which p, q: dr(p, q) and dr(q, p)?  Express using only core objects and ddr, which objects will belong to a cluster