Data Mining – Clustering and Classification 1.  Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means.

Data Mining – Clustering and Classification 1

 Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means Clustering ◦ Question 3: Classification Tree 2

 What is Clustering? 4

Clustering is finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. Clustering can be used to “understand” data (i.e. group related documents for bowsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations), or “summarise” the data (i.e. reduce the size of large data sets – precipitation in Australia for example). 5

 What is Classification? 6

Classification is where, given a collection of records (known as a training set), find a model for the class attribute as a function of the values of other attributes. The goal of classification is previously unseen records should be assigned a class as accurately as possible – this is done using a test set to show the accuracy of the model. Classification examples include predicting tumor cells as benign or malignant, classifying credit card transactions as legitimate or fraudulent, classifying structures of proteins as alpha-helix, beta- sheet, or random coil or categorising news stories as finance, weather, entertainment, sports, etc. 7

 How do the two differ? 8

The difference between these two approaches is clustering is used to find similar/related information in a data set, whereas classification takes a data set and classifies the class attribute based on a model using a function of the values of other attributes. Clustering returns a group of data and classification returns an classification for each row of information in the data set. 9

 Consider the following set of two dimensional records:  Use the k-means algorithm to cluster the data. We assume there are 3 clusters, and the records 1, 3 and 5 are used as the initial centroids (means). 11 RIDDimension1Dimension2 184 254 324 426 528 686

Calculate the distances of the three clusters (1 is the centroid for C1, 3 is the centroid for C2 and 5 is the centroid for C3): 12 RIDDistance to 1Distance to 3Distance to 5 10 2 = sqrt((8-5)^2 + (4-4)^2) = 3 = sqrt((2-5)^2 + (4-4)^2) = 3 = sqrt((2-5)^2 + (8-4)^2) = 5 3 0 4 = sqrt((8-2)^2 + (4-6)^2) = 6.3 = sqrt((2-2)^2 + (4-6)^2) = 2 = sqrt((2-2)^2 + (8-6)^2) = 2 5 0 6= sqrt((8-8)^2 + (4-6)^2) = 2 = sqrt((2-8)^2 + (4-6)^2) = 6.3 = sqrt((2-8)^2 + (8-6)^2) = 6.3

Since the following records are lowest in their respective C’s, the following clusters and records are produced: C1: 1, 2, 6 C2: 3, 4 C3: 5 Hence the mean for the clusters are now: C1: D1 = ((8 + 5 + 8)/3) = 7, D2 = ((4 + 4 + 6)/3 = 4.7 C2: D1 = ((2 + 2)/2) = 2, D2 = ((4 + 6)/2) = 5 C3: D1 = 2, D2 = 8 13

Now Calculated Distances based on new coordinates of the clusters: 14 RIDDistance to C1 (7, 4.7)Distance to C2 (2, 5)Distance to C3 (2, 8) 1 = sqrt((7-8)^2 + (4.7-4)^2) = 1.2 = sqrt((2-8)^2 + (5-4)^2) = 6.1 = sqrt((2-8)^2 + (8-4)^2) = 7.2 2 = sqrt((7-5)^2 + (4.7-4)^2) = 2.1 = sqrt((2-5)^2 + (5-4)^2) = 3.2 = sqrt((2-5)^2 + (8-4)^2) = 5 3 = sqrt((7-2)^2 + (4.7-4)^2) = 5 = sqrt((2-2)^2 + (5-4)^2) = 1 = sqrt((2-2)^2 + (8-4)^2) = 4 4 = sqrt((7-2)^2 + (4.7-6)^2) = 5.17 = sqrt((2-2)^2 + (5-6)^2) = 1 = sqrt((2-2)^2 + (8-6)^2) = 2 5 = sqrt((7-2)^2 + (4.7-8)^2) = 6 = sqrt((2-2)^2 + (5-8)^2) = 3 = sqrt((2-2)^2 + (8-9)^2) = 0 6= sqrt((7-8)^2 + (4.7-6)^2) = 1.6 = sqrt((2-8)^2 + (5-6)^2) = 6.1 = sqrt((2-8)^2 + (8-6)^2) = 6.3

Since the following records are lowest in their respective C’s, the following new clusters and records are produced: C1: 1, 2, 6 C2: 3, 4 C3: 5 Since the Clusters have not had a change in records, these clusters are the accepted clusters for the records. 15

Apply the Hunt’s classification algorithm to build a decision tree for the following training data set (where loanworthy is the class, RID is record ID), assuming we test the attributes from left to right. 16 RIDMarriedSalary Acct_Balanc e AgeLoanworthy 1no>=50K<5K>=25yes 2 >=50K>=5K>=25yes 3 20K…50K<5K<25no 4 <20K>=5K<25no 5 <20K<5K>=25no 6yes20K…50K>=5K>=25yes

Data Mining – Clustering and Classification 1.  Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means.

Similar presentations

Presentation on theme: "Data Mining – Clustering and Classification 1.  Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining – Clustering and Classification 1.  Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means.

Similar presentations

Presentation on theme: "Data Mining – Clustering and Classification 1.  Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means."— Presentation transcript:

Similar presentations

About project

Feedback