1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business

2 Introduction Clustering –Groups objects without pre-specified class labels into a set of non-predetermined classes of similar objects Clustering O1O1 O3O3 O2O2 O5O5 O4O4 O6O6 O1O1 O2O2 O6O6 O5O5 O3O3 O4O4 O i :contains relevant attribute values without class labels Class X Class Y Class Z Classes X, Y or Z: non-predetermined

3 An example We can cluster customers based on their purchase behavior.

4 Applications For discovery –Customers by shopping behavior, credit rating and/or demographics –Insurance policy holders –Plants, animals, genes, protein structures –Hand writing –Images –Drawings –Land uses –Documents –Web pages For pre-processing – data segmentation and outlier analysis For conceptual clustering – traditional clustering + classification/characterization to describe each cluster

5 Basic Terminology Cluster – a collection of objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Distance measure – how dissimilar (similar) objects are –Non-negative –Distance between the same objects = 0 –Symmetric –The distance between two objects, A & B, is smaller than the sum of the distance from A to another object C and the distance from C to B

6 Clustering Process Compute similarity between objects/clusters Clustering based on similarity between objects/clusters

7 Similarity/Dissimilarity An object (e.g., a customer) has a list of variables (e.g., attributes of a customer such as age, spending, gender etc.) When measuring similarity between objects we measure similarity between variables of objects. Instead of measuring similarity between variables, we use distance to measure dissimilarity between variables.

8 Similarity/Dissimilarity Continuous variables Manhattan distance Euclidean distance

9 Dissimilarity For two objects X and Y with continuous variables 1,2,…n, Manhattan distance is defined as:

10 Dissimilarity Example of Manhattan distance NAMEAGESPENDING($) Sue212300 Carl272600 TOM455400 JACK526000

11 Dissimilarity For two objects X and Y with continuous variables 1,2,…n, Euclidean distance is defined as:

12 Dissimilarity Example of Euclidean distance NAMEAGESPENDING($) Sue2123200 Carl2723330 TOM4523260 JACK5223400

13 Similarity/Dissimilarity Binary variable Normalized Manhattan distance = number of un- matched variables/total number of variables NAME Married Gender Home Internet SueYFY CarlYMY TOMNMN JACKNMN

14 Similarity/Dissimilarity Nominal/ordinal variables NAMEAGE BALANCE($) INCOME EYES GENDER Karen212300high Blue F Sue212300high Blue F Carl275400high Brown M We assign 0/1 based on exact-match criteria: –Same gender = 0, Different gender = 1 –Same eye color = 0, different eye color = 1 We can also “rank” an attribute –income high =3, medium = 2, low = 1 –E.g. distance (high, low)=2

15 Distance Calculation NAME AGEBALANCE($) INCOME EYES GENDER Sue 212300high Blue F Carl 275400high Brown M Manhattan Difference: 6 + 3100+ 0 + 1 + 1 = 3108 Euclidean Difference: Square root(6 2 + 3100 2 + 0 + 1 + 1) Is there a problem?

16 Normalization Normalization of dimension values: –In the previous example, “balance” is dominant –Set the minimum and maximum distance values for each dimension to be the same (e.g., 0 - 100) NAMEAGEBALANCE($) INCOME EYES GENDER Sue212300high Blue F Carl275400high Brown M Don 180low Black M Amy6216,543low Blue F Assume that age range from 0 - 100 Manhattan Difference (Sue, Carl): 6 + 100* ((5400-2300)/16543)+ 0 + 100 + 100

17 Standardization Calculate the mean value Calculate mean absolute deviation Standardize each variable value as: Standardized value = (original value – mean value)/ mean absolute deviation

18 Hierarchical Algorithms Output: a tree of clusters where a parent node (cluster) consists of objects in its child nodes (clusters) Input: Objects and distance measure only. No need for a pre-specified number of clusters. Agglomerative hierarchical clustering: –Bottom-up –Leaf nodes are individual objects –Merge lower level clusters by optimizing a clustering criterion until the termination conditions are satisfied. –More popular

19 Hierarchical Algorithms Output: a tree of clusters where a parent node (cluster) consists of objects in its child nodes (clusters) Input: Objects and distance measure only. No need for a pre-specified number of clusters. Divisive hierarchical clustering: –Top-down –The root node corresponds to the whole set of the objects –Subdivides a cluster into smaller clusters by optimizing a clustering criterion until the termination conditions are met.

20 Clustering based on dissimilarity After calculating dissimilarity between objects, a dissimilarity matrix can be created with objects as indexes and dissimilarities between objects as elements. Distance between clusters –Min, Max, Mean and Average

21 Clustering based on dissimilarity Sue Tom Carl Jack Mary Sue 0 6 8 2 7 Tom 6 0 1 5 3 Carl 8 1 0 10 9 Jack 2 5 10 0 4 Mary 7 3 9 4 0

22 Bottom-up Hierarchical Clustering Step 1:Initially, place each object in an unique cluster Step 2: Calculate dissimilarity between clusters Dissimilarity between clusters is the minimum dissimilarity between two objects of the clusters, one from each cluster Step 3: Merge two clusters with the least dissimilarity Step 4: Continue steps 1-3 until all objects are in one cluster

23 Nearest Neighbor Clustering (Demographic Clustering) Dissimilarity by votes Merge an object into a cluster with the lowest avg dissimilarity If the avg dissimilarity with each cluster exceeds a threshold, the object forms its own cluster Stop after a max # of passes, a max # of clusters or no significant changes in the avg dissimilarities in each cluster

24 Comparative Criteria for Clustering Algorithms Performance Scalability Ability to deal with different attribute types Clusters with arbitrary shape Need K or not Noise handling Sensitivity to the order of input records High dimensionality (# of attributes) Constraint-based clustering Interpretability and usability

25 Summary Problem definition –Input: objects without class labels –Output: clusters for discovery and conceptual clustering for prediction Similarity/dissimilarity measures and calculations Hierarchical Clustering Criteria for comparing algorithms Readings – T2, pp. 335 – 344 and 354-356

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.

Similar presentations

Presentation on theme: "1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.

Similar presentations

Presentation on theme: "1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Clustering Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of."— Presentation transcript:

Similar presentations

About project

Feedback