1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
Copyright Jiawei Han, modified by Charles Ling for CS411a
What is Cluster Analysis?
Clustering.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
Optimization of ICP Using K-D Tree
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
Clustering II.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Classification Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair.
What is Cluster Analysis
Cluster Analysis (1).
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
CLUSTERING (Segmentation)
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Hierarchical Clustering
Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering.
Chapter 2: Getting to Know Your Data
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Clustering.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
MIS 451 Building Business Intelligence Systems Clustering (1)
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
What Is Cluster Analysis?
Data Mining: Basic Cluster Analysis
Chapter 15 – Cluster Analysis
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Hierarchical Clustering
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Topic 3: Cluster Analysis
K-means and Hierarchical Clustering
Hierarchical and Ensemble Clustering
CSE572, CBS598: Data Mining by H. Liu
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Multivariate Statistical Methods
Hierarchical and Ensemble Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
CSE572, CBS572: Data Mining by H. Liu
MIS 451 Building Business Intelligence Systems
What Is Good Clustering?
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business

2 Introduction Clustering –Groups objects without pre-specified class labels into a set of non-predetermined classes of similar objects Clustering O1O1 O3O3 O2O2 O5O5 O4O4 O6O6 O1O1 O2O2 O6O6 O5O5 O3O3 O4O4 O i :contains relevant attribute values without class labels Class X Class Y Class Z Classes X, Y or Z: non-predetermined

3 An example We can cluster customers based on their purchase behavior.

4 Applications For discovery –Customers by shopping behavior, credit rating and/or demographics –Insurance policy holders –Plants, animals, genes, protein structures –Hand writing –Images –Drawings –Land uses –Documents –Web pages For pre-processing – data segmentation and outlier analysis For conceptual clustering – traditional clustering + classification/characterization to describe each cluster

5 Basic Terminology Cluster – a collection of objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Distance measure – how dissimilar (similar) objects are –Non-negative –Distance between the same objects = 0 –Symmetric –The distance between two objects, A & B, is smaller than the sum of the distance from A to another object C and the distance from C to B

6 Clustering Process Compute similarity between objects/clusters Clustering based on similarity between objects/clusters

7 Similarity/Dissimilarity An object (e.g., a customer) has a list of variables (e.g., attributes of a customer such as age, spending, gender etc.) When measuring similarity between objects we measure similarity between variables of objects. Instead of measuring similarity between variables, we use distance to measure dissimilarity between variables.

8 Similarity/Dissimilarity Continuous variables Manhattan distance Euclidean distance

9 Dissimilarity For two objects X and Y with continuous variables 1,2,…n, Manhattan distance is defined as:

10 Dissimilarity Example of Manhattan distance NAMEAGESPENDING($) Sue Carl TOM JACK526000

11 Dissimilarity For two objects X and Y with continuous variables 1,2,…n, Euclidean distance is defined as:

12 Dissimilarity Example of Euclidean distance NAMEAGESPENDING($) Sue Carl TOM JACK

13 Similarity/Dissimilarity Binary variable Normalized Manhattan distance = number of un- matched variables/total number of variables NAME Married Gender Home Internet SueYFY CarlYMY TOMNMN JACKNMN

14 Similarity/Dissimilarity Nominal/ordinal variables NAMEAGE BALANCE($) INCOME EYES GENDER Karen212300high Blue F Sue212300high Blue F Carl275400high Brown M We assign 0/1 based on exact-match criteria: –Same gender = 0, Different gender = 1 –Same eye color = 0, different eye color = 1 We can also “rank” an attribute –income high =3, medium = 2, low = 1 –E.g. distance (high, low)=2

15 Distance Calculation NAME AGEBALANCE($) INCOME EYES GENDER Sue high Blue F Carl high Brown M Manhattan Difference: = 3108 Euclidean Difference: Square root( ) Is there a problem?

16 Normalization Normalization of dimension values: –In the previous example, “balance” is dominant –Set the minimum and maximum distance values for each dimension to be the same (e.g., ) NAMEAGEBALANCE($) INCOME EYES GENDER Sue212300high Blue F Carl275400high Brown M Don 180low Black M Amy6216,543low Blue F Assume that age range from Manhattan Difference (Sue, Carl): * (( )/16543)

17 Standardization Calculate the mean value Calculate mean absolute deviation Standardize each variable value as: Standardized value = (original value – mean value)/ mean absolute deviation

18 Hierarchical Algorithms Output: a tree of clusters where a parent node (cluster) consists of objects in its child nodes (clusters) Input: Objects and distance measure only. No need for a pre-specified number of clusters. Agglomerative hierarchical clustering: –Bottom-up –Leaf nodes are individual objects –Merge lower level clusters by optimizing a clustering criterion until the termination conditions are satisfied. –More popular

19 Hierarchical Algorithms Output: a tree of clusters where a parent node (cluster) consists of objects in its child nodes (clusters) Input: Objects and distance measure only. No need for a pre-specified number of clusters. Divisive hierarchical clustering: –Top-down –The root node corresponds to the whole set of the objects –Subdivides a cluster into smaller clusters by optimizing a clustering criterion until the termination conditions are met.

20 Clustering based on dissimilarity After calculating dissimilarity between objects, a dissimilarity matrix can be created with objects as indexes and dissimilarities between objects as elements. Distance between clusters –Min, Max, Mean and Average

21 Clustering based on dissimilarity Sue Tom Carl Jack Mary Sue Tom Carl Jack Mary

22 Bottom-up Hierarchical Clustering Step 1:Initially, place each object in an unique cluster Step 2: Calculate dissimilarity between clusters Dissimilarity between clusters is the minimum dissimilarity between two objects of the clusters, one from each cluster Step 3: Merge two clusters with the least dissimilarity Step 4: Continue steps 1-3 until all objects are in one cluster

23 Nearest Neighbor Clustering (Demographic Clustering) Dissimilarity by votes Merge an object into a cluster with the lowest avg dissimilarity If the avg dissimilarity with each cluster exceeds a threshold, the object forms its own cluster Stop after a max # of passes, a max # of clusters or no significant changes in the avg dissimilarities in each cluster

24 Comparative Criteria for Clustering Algorithms Performance Scalability Ability to deal with different attribute types Clusters with arbitrary shape Need K or not Noise handling Sensitivity to the order of input records High dimensionality (# of attributes) Constraint-based clustering Interpretability and usability

25 Summary Problem definition –Input: objects without class labels –Output: clusters for discovery and conceptual clustering for prediction Similarity/dissimilarity measures and calculations Hierarchical Clustering Criteria for comparing algorithms Readings – T2, pp. 335 – 344 and