DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam

Overview Definition of Clustering Existing clustering methods Clustering examples Classification Classification examples Conclusion

Definition Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. Clustering is “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Mu-Yu Lu, SJSU

Why clustering? A few good reasons... Simplifications Pattern detection Useful in data concept construction Unsupervised learning process

Where to use clustering? Data mining Information retrieval text mining Web analysis marketing medical diagnostic

Which method should I use? Type of attributes in data Scalability to larger dataset Ability to work with irregular data Time cost complexity Data order dependency Result presentation

Major Existing clustering methods Distance-based Hierarchical Partitioning Probabilistic

Measuring Similarity Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) There is a separate “quality” function that measures the “goodness” of a cluster. The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. Weights should be associated with different variables based on applications and data semantics. It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective. Professor Lee, Sin-Min

Distance based method In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance. This is called distance-based clustering.

Hierarchical clustering Agglomerative (bottom up) 1.start with 1 point (singleton) 2.recursively add two or more appropriate clusters 3.Stop when k number of clusters is achieved. Divisive (top down) 1.Start with a big cluster 2.Recursively divide into smaller clusters 3.Stop when k number of clusters is achieved.

general steps of hierarchical clustering Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this: Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. Compute distances (similarities) between the new cluster and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into K number of clusters Mu-Yu Lu, SJSU

Exclusive vs. non exclusive clustering In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi-dimensional plane. On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership.

Partitioning clustering 1.Divide data into proper subset 2.recursively go through each subset and relocate points between clusters (opposite to visit-once approach in Hierarchical approach) This recursive relocation= higher quality cluster

Probabilistic clustering 1.Data are picked from mixture of probability distribution. 2.Use the mean, variance of each distribution as parameters for cluster 3.Single cluster membership

Classification Examples Teachers classify students’ grades as A, B, C, D, or F. Identify mushrooms as poisonous or edible. Predict when a river will flood. Identify individuals with credit risks. Speech recognition Pattern recognition

Classification Classification Problem Overview Classification Techniques –Regression –Distance –Decision Trees –Rules –Neural Networks Goal: Provide an overview of the classification problem and introduce some of the basic algorithms

Classification Ex: Grading If x >= 90 then grade =A. If 80<=x<90 then grade =B. If 70<=x<80 then grade =C. If 60<=x<70 then grade =D. If x<50 then grade =F. >=90<90 x >=80<80 x >=70<70 x F B A >=60<50 x C D

Classification Techniques Approach: 1.Create specific model by evaluating training data (or using domain experts’ knowledge). 2.Apply model developed to new data. Classes must be predefined Most common techniques use DTs, NNs, or are based on distances or statistical methods.

Defining Classes Partitioning Based Distance Based

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Similar presentations

Presentation on theme: "DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Similar presentations

Presentation on theme: "DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam."— Presentation transcript:

Similar presentations

About project

Feedback