Clustering Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English CS240B lecture notes.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Copyright Jiawei Han, modified by Charles Ling for CS411a
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
What is Cluster Analysis?
Clustering.
Clustering Basic Concepts and Algorithms
CS690L: Clustering References:
Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
MIS2502: Data Analytics Clustering and Segmentation.
Data Mining Techniques: Clustering
What is Cluster Analysis?
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
Clustering (slide from Han and Kamber)
Clustering.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis.
What is Cluster Analysis
Segmentação (Clustering) (baseado nos slides do Han)
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
Cluster Analysis.
CLUSTERING (Segmentation)
UIC - CS 5941 Chapter 5: Clustering. UIC - CS 5942 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering,
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis Part I
11/15/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Clustering.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
October 27, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 7 — ©Jiawei Han and Micheline.
1 Clustering Sunita Sarawagi
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Chapter 2: Getting to Know Your Data
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Types of Data How to Calculate Distance? Dr. Ryan Benton January 29, 2009.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Algorithms
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering Analysis CS 685: Special Topics in Data Mining Jinze Liu.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
1 Core Techniques: Cluster Analysis Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English.
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Data Mining Lecture 7. Course Syllabus Clustering Techniques (Week 6) –K-Means Clustering –Other Clustering Techniques.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
What Is Cluster Analysis?
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Topic 3: Cluster Analysis
©Jiawei Han and Micheline Kamber Department of Computer Science
Self organizing networks
Fuzzy Clustering.
Dr. Unnikrishnan P.C. Professor, EEE
Clustering and Multidimensional Scaling
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
CSCI N317 Computation for Scientific Applications Unit Weka
What Is Good Clustering?
Clustering Wei Wang.
Topic 5: Cluster Analysis
What is Cluster Analysis?
Data Mining: Concepts and Techniques — Chapter 2 —
Presentation transcript:

Clustering Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English CS240B lecture notes by C. Zaniolo.

Example: Custormer Segmentation Given: a Large data base of customer data containing their properties and past buying records: Find groups of customers with similar behavior (clusters) Find customers with unusual behavior (outliers)

Problem Definition: Given a set of N items in D dimensions Find: a natural partitioning of the data set into a number of clusters (k) + outliers, such that: items in same cluster are similar  intra-cluster similarity is maximized items from different clusters are different  inter-cluster similarity is minimized No predefined classes! Unsupervised Learnig Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms.

Data Mining: Concepts and Techniques — Chapter 7 — These slides are based on those downloaded from www.cs.uiuc.edu/~hanj Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign ©2006 Jiawei Han and Micheline Kamber

Clustering: Rich Applications and Multidisciplinary Efforts Pattern Recognition Spatial Data Analysis Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

K-Means K-means (MacQueen, 1967) is one of the simplest clustering algorithms to minimize distance from centers. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. Assign each object to the group that has the closest centroid. When all objects have been assigned, recalculate the positions of the K centroids. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.

K-means example, step 1 k1 Y Pick 3 initial k2 cluster centers (randomly)

K-means example, step 2 k1 Y k2 Assign each point to the closest cluster center k3

K-means example, step 3 X Y k1 k1 k2 Move each cluster center to the mean of each cluster k3 k2 k3

K-means example, step 4 k1 Y k3 k2 X Reassign points closest to a different new cluster center Q: Which points are reassigned? X Y k1 k3 k2

K-means example, step 4 k1 Y k3 k2 X Reassign points to the closest center Q: points reassigned: X Y k1 k3 k2

K-means example, step 5 X Y k1 k1 re-compute cluster means k2 k3 k2 k3

K-means example, step 6 Reassign points to clusters: k1 No change: Y Reassign points to clusters: No change: The end k1 k2 k3

K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers

Similarity and Distance K-means and all methods group together the most similar objects Where some notion of distance is used to define similarity Close-by, i.e., similar Far apart, i.e. dissimilar Distance obvious in our XY planes, not so obvious in general: categorical, boolean, vectors, etc.

Dissimilarity between Items is expressed by their Distance Data matrix No assumption Typical Symmetric matrix

Type of data in clustering analysis Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types

Interval-Scaled Variables Interval-scaled are continuous measurements in roughly linear scale—e.g., temperature, weight, coordinates—which are then assumed to range over an interval. Notion of Distance between two vectors: X=<x1,…,xn> and Y=<y1,…,yn>: (|x1-y1|q + … + |xn-yn|q)1/q q=2: Euclidean distance q=1: Manhattan distance 1<q<2: Minkowski distance

Metric Properties d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) Are satisfied by all three previous distances: d(i,j)  0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j)  d(i,k) + d(k,j)

Heterogeneous Variables Standardization is needed: E.g. if have n values for x Calculate the mean absolute deviation: w.r.t. the mean: Calculate the standardized measurement (z-score) Using mean absolute deviation is more robust than using standard deviation

Dissimilarity between Binary Variables Example gender is a symmetric attribute the remaining attributes are asymmetric binary (0 denotes normal condition) let the values Y and P be set to 1, and the value N be set to 0

Binary Variables—vector of size p Object i Object j A contingency table for binary data Distance measure for symmetric binary variables:

Binary Variables—vector of size p Object i Object j A contingency table for binary data Distance measure for symmetric binary variables: Jaccard coefficient (similarity measure for asymmetric binary variables): Distance measure for asymmetric binary variables. [1-sim]

Dissimilarity between Binary Variables Example gender is a symmetric attribute the remaining attributes are asymmetric binary dissimilarity for asymmetric attribute only

Categorical Variables A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching m: # of matches, p: total # of variables: Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states

Ordinal Variables An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled replace xif by their rank map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by compute the dissimilarity using methods for interval-scaled variables

Ratio-Scaled Variables Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods: treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) apply logarithmic transformation yif = log(xif) treat them as continuous ordinal data treat their rank as interval-scaled

Combining Variables of Mixed types Bring all the variables into a common scale—typically ranging between 0 and 1.

Vector Objects Vector objects: keywords in documents, gene features in micro-arrays, etc. Broad applications: information retrieval, biologic taxonomy, etc. Cosine measure A variant: Tanimoto coefficient (for binary)