Presentation is loading. Please wait.

Presentation is loading. Please wait.

MS Clustering Chapters15_to_17_Part5. What is it  Clustering is the classification of objects into different groups, or more precisely, the partitioning.

Similar presentations


Presentation on theme: "MS Clustering Chapters15_to_17_Part5. What is it  Clustering is the classification of objects into different groups, or more precisely, the partitioning."— Presentation transcript:

1 MS Clustering Chapters15_to_17_Part5

2 What is it  Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.

3 We have being doing it  We have been grouping people, cars, etc.  We are just not very good when we have too many items to keep track  Experts can track five to six dimensions, we may have data set with many times of that  We can only see the obvious groups, most likely  It is difficult for us to see the hidden ones, or the combined ones

4 An Example  You can group your customers (for a bike store) into several groups based on Gender Income Age Etc  There may be other things, such as do they play game?

5 Principles of Clustering  Guessing and lying (MS) Setting clusters  Training with data  Calibrating your clusters  Training again  Repeating until converged or going nowhere  The clustering mythology is very sensitive to the starting points and can converge at local solutions that many not be optimal global solution

6 Soft and hard clustering  One case one cluster – hard  One case several clusters – soft

7 Scalable clustering  Ideally, the data point that will not change its cluster do not need to be considered  In MS’ implementation, it will read the first 50,000. If that don’t converge, we process the next 50K, rather than read in and process all 100K.

8 Few interesting parameters  Clustering_Method What method to use 1~4  Clustering_Count The number of clusters to find 0 makes the algorithms to guess a good number  Minimum_Support What case count can be considered as empty  Stopping_tolerance The number of cases switch clusters  Sample_size For scalable clustering  Cluster_Seed Where to put the clusters  Maximum_Input_attributes A number before attributed considered before automatic feature selection kicks in. Automatic feature selection, selects the most popular attributes  Maximum_states Possible values

9 Understanding The Results  Comprehending the results can be difficult because you have to look for many directions High-level overview Look into a cluster Determine how a cluster is different from a near by one

10 High-level overview  Cluster Profiles view -- too much info Getting some sense regarding who/what are in each cluster

11 High-level overview  Cluster Diagram view Get some sense the relationships among clusters

12 Look into a cluster  The Cluster characteristic view See the attributes that are going together Note that an attribute ranks high may be because it is ranked high on all the cluster. In that case, it is not that interesting.

13 Cluster characteristic view

14 Look outside a cluster  Discrimination and Complement Shows you what attributes are important


Download ppt "MS Clustering Chapters15_to_17_Part5. What is it  Clustering is the classification of objects into different groups, or more precisely, the partitioning."

Similar presentations


Ads by Google