DATA MINING Introductory and Advanced Topics Part II - Clustering

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
PARTITIONAL CLUSTERING
CS690L: Clustering References:
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
Data Mining Techniques: Clustering
2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Hierarchical Clustering
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering.
Clustering.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Clustering.
Unsupervised Learning
Data Mining: Basic Cluster Analysis
DATA MINING Spatial Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Chapter 15 – Cluster Analysis
Machine Learning Clustering: K-means Supervised Learning
Slides by Eamonn Keogh (UC Riverside)
Data Mining K-means Algorithm
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
CS 685: Special Topics in Data Mining Jinze Liu
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
Clustering.
John Nicholas Owen Sarah Smith
Hierarchical and Ensemble Clustering
CSE572, CBS598: Data Mining by H. Liu
CS 485G: Special Topics in Data Mining
Hierarchical and Ensemble Clustering
CSE572, CBS572: Data Mining by H. Liu
CSE572, CBS572: Data Mining by H. Liu
Data Mining – Chapter 4 Cluster Analysis Part 2
MIS2502: Data Analytics Clustering and Segmentation
Clustering Wei Wang.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Unsupervised Learning
Presentation transcript:

DATA MINING Introductory and Advanced Topics Part II - Clustering Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. Part II - Clustering © Prentice Hall

Clustering Outline Clustering Problem Overview Hierarchical Algorithms Goal: Provide an overview of the clustering problem and introduce some of the basic algorithms Clustering Problem Overview Hierarchical Algorithms Partitional Algorithms Categorical Data Summary © Prentice Hall

Clustering Examples Segment customer database based on similar buying patterns. Group houses in a town into neighborhoods based on similar features. Identify new plant species Identify similar Web usage patterns © Prentice Hall

Clustering Example © Prentice Hall

Geographic Distance Based Clustering Houses Size Based Geographic Distance Based © Prentice Hall

Clustering vs. Classification No prior knowledge Number of clusters Meaning of clusters Unsupervised learning © Prentice Hall

Clustering Issues Outlier handling Dynamic data Interpreting results Evaluating results Number of clusters Data to be used Scalability © Prentice Hall

Impact of Outliers on Clustering © Prentice Hall

Clustering Problem Given a database D={t1,t2,…,tn} of tuples and an integer value k, the Clustering Problem is to define a mapping f:Dg{1,..,k} where each ti is assigned to one cluster Kj, 1<=j<=k. A Cluster, Kj, contains precisely those tuples mapped to it. Unlike classification problem, clusters are not known a priori. © Prentice Hall

Types of Clustering Hierarchical – Nested set of clusters created. Partitional – One set of clusters created. Incremental – Each element handled one at a time. Simultaneous – All elements handled together. Overlapping/Non-overlapping © Prentice Hall

Clustering Approaches Hierarchical Partitional Categorical Large DB Agglomerative Divisive Sampling Compression © Prentice Hall

Cluster Parameters © Prentice Hall

Distance Between Clusters Single Link: smallest distance between points Complete Link: largest distance between points Average Link: average distance between points Centroid: distance between centroids © Prentice Hall

Hierarchical Clustering Clusters are created in levels actually creating sets of clusters at each level. Agglomerative Initially each item in its own cluster Iteratively clusters are merge together Bottom Up Divisive Initially all items in one cluster Large clusters are successively divided Top Down © Prentice Hall

Hierarchical Algorithms Single Link MST Single Link Complete Link Average Link © Prentice Hall

Dendrogram Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level. Leaf – individual clusters Root – one cluster A cluster at level i is the union of its children clusters at level i+1. © Prentice Hall

Levels of Clustering © Prentice Hall

Agglomerative Example B C D E 1 2 3 4 5 A B E C D Threshold of 1 2 3 4 5 A B C D E © Prentice Hall

MST Example A B A B C D E 1 2 3 4 5 E C D © Prentice Hall

Agglomerative Algorithm © Prentice Hall

Single Link View all items with links (distances) between them. Finds maximal connected components in this graph. Two clusters are merged if there is at least one edge which connects them. Uses threshold distances at each level. Could be agglomerative or divisive. © Prentice Hall

MST Single Link Algorithm © Prentice Hall

Single Link Clustering © Prentice Hall

Partitional Clustering Nonhierarchical Creates clusters in one step as opposed to several steps. Since only one set of clusters is output, the user normally has to input the desired number of clusters, k. Usually deals with static sets. © Prentice Hall

Partitional Algorithms MST Squared Error K-Means Nearest Neighbor PAM BEA GA © Prentice Hall

MST Algorithm © Prentice Hall

Squared Error Minimized squared error © Prentice Hall

Squared Error Algorithm © Prentice Hall

K-Means Initial set of clusters randomly chosen. Iteratively, items are moved among sets of clusters until the desired set is reached. High degree of similarity among elements in a cluster is obtained. Given a cluster Ki={ti1,ti2,…,tim}, the cluster mean is mi = (1/m)(ti1 + … + tim) © Prentice Hall

K-Means Example Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 © Prentice Hall

K-Means Algorithm © Prentice Hall

Nearest Neighbor Items are iteratively merged in the existing clusters that are closest. Incremental Threshold, t, used to determine if items are added to existing clusters or a new cluster is created. © Prentice Hall

Nearest Neighbor Algorithm © Prentice Hall

PAM Partitioning Around Medoids (PAM) (K-Medoids) Handles outliers well. Ordering of input does not impact results. Does not scale well. Each cluster represented by one item, called the medoid. Initial set of k medoids randomly chosen. © Prentice Hall

PAM © Prentice Hall

PAM Cost Calculation At each step in algorithm, medoids are changed if the overall cost is improved. Cjih – cost change for an item tj associated with swapping medoid ti with non-medoid th. © Prentice Hall © Prentice Hall

PAM Algorithm © Prentice Hall

BEA Bond Energy Algorithm Database design (physical and logical) Vertical fragmentation Determine affinity (bond) between attributes based on common usage. Algorithm outline: Create affinity matrix Convert to BOND matrix Create regions of close bonding © Prentice Hall

BEA © Prentice Hall

Genetic Algorithm Example {A,B,C,D,E,F,G,H} Randomly choose initial solution: {A,C,E} {B,F} {D,G,H} or 10101000, 01000100, 00010011 Suppose crossover at point four and choose 1st and 3rd individuals: 10100011, 01000100, 00011000 What should termination criteria be? © Prentice Hall

GA Algorithm © Prentice Hall

Clustering Large DB Most clustering algorithms assume a large data structure which is memory resident. Clustering may be performed first on a sample of the database then applied to the entire database. Algorithms BIRCH DBSCAN CURE © Prentice Hall