# 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.

## Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference."— Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference

Contents 2

Did you get to create the neighborhood map? table(mapcoord\$NEIGHBORHOOD) mapcoord\$NEIGHBORHOOD <- as.factor(mapcoord\$NEIGHBORHOOD) geoPlot(mapcoord,zoom=12,color=mapcoord\$NEIGH BORHOOD) # this one is easier 3

4

KNN! Did you loop over k? { knnpred<- knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=ma pcoord[trainid,2],k=5) knntesterr<- sum(knnpred!=mappred\$class)/length(testid) } knntesterr [1] 0.1028037 0.1308411 0.1308411 0.1588785 0.1401869 0.1495327 0.1682243 0.1962617 0.1962617 0.1869159 What do you think? 5

What else could you classify? SALE.PRICE? –If so, how would you measure error? # I added SALE.PRICE as 5 th column in adduse… > pcolor<- color.scale(log(mapcoord[,5]),c(0,1,1),c(1,1,0),0 ) > geoPlot(mapcoord,zoom=12,color=pcolor) TAX.CLASS.AT.PRESENT? TAX.CLASS.AT.TIME.OF.SALE? measure error? 6

Summing up ‘knn’ Advantages –Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”) –Effective if the training data is large Disadvantages –Need to determine value of parameter K (number of nearest neighbors) –Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only? Friday – yet more KNN: weighted KNN… 7

Return object clusterA vector of integers (from 1:k) indicating the cluster to which each point is allocated. centersA matrix of cluster centres. totssThe total sum of squares. withinssVector of within-cluster sum of squares, one component per cluster. tot.withinssTotal within-cluster sum of squares, i.e., sum(withinss). betweenssThe between-cluster sum of squares, i.e. totss-tot.withinss. sizeThe number of points in each cluster. 9

10 > plot(mapmeans,mapobj\$clu ster) ZIP.CODE, NEIGHBORHOOD, TOTAL.UNITS, LAND.SQUARE.FEET, GROSS.SQUARE.FEET, SALE.PRICE, latitude, longitude' ZIP.CODE, NEIGHBORHOOD, TOTAL.UNITS, LAND.SF, GROSS.SF, SALE.PRICE, lat, long > mapobj\$size [1] 432 31 1 11 56

11 > mapobj\$centers adduse.ZIP.CODE as.numeric.mapcoord.NEIGHBORHOOD. adduse.TOTAL.UNITS adduse.LAND.SQUARE.FEET 1 10464.09 19.47454 1.550926 2028.285 2 10460.65 16.38710 25.419355 11077.419 3 10454.00 20.00000 1.000000 29000.000 4 10463.45 10.90909 42.181818 10462.273 5 10464.00 17.42857 4.714286 14042.214 adduse.GROSS.SQUARE.FEET adduse.SALE.PRICE adduse..querylist.latitude. adduse..querylist.longitude. 1 1712.887 279950.4 40.85280 -73.87357 2 26793.516 2944099.9 40.85597 -73.89139 3 87000.000 24120881.0 40.80441 -73.92290 4 40476.636 6953345.4 40.86009 -73.88632 5 9757.679 885950.9 40.85300 -73.87781

Plotting clusters require(cluster) clusplot(mapmeans, mapobj\$cluster, color=TRUE, shade=TRUE, labels=2, lines=0) 12

Plot 14

Clusplot (k=17) 15

Dendogram for this = tree of the clusters: 16 Highly supported by data? Okay, this is a little complex – perhaps something simpler?

Hierarchical clustering > d <- dist(as.matrix(mtcars)) > hc <- hclust(d) > plot(hc) 17

Decision tree (example) > require(party) # don’t get me started! > str(iris) 'data.frame':150 obs. of 5 variables: \$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... \$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... \$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... \$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... \$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1... > iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) 18

> print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 150 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264 2)* weights = 50 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865 5)* weights = 46 4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7 7)* weights = 46 19

plot(iris_ctree) 20

However… there is more 21

Bayes > cl <- kmeans(iris[,1:4], 3) > table(cl\$cluster, iris[,5]) setosa versicolor virginica 2 0 2 36 1 0 48 14 3 50 0 0 # > m <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(m, iris[,1:4]), iris[,5]) setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47 22

Using a contingency table > data(Titanic) > mdl <- naiveBayes(Survived ~., data = Titanic) > mdl 23 Naive Bayes Classifier for Discrete Predictors Call: naiveBayes.formula(formula = Survived ~., data = Titanic) A-priori probabilities: Survived No Yes 0.676965 0.323035 Conditional probabilities: Class Survived 1st 2nd 3rd Crew No 0.08187919 0.11208054 0.35436242 0.45167785 Yes 0.28551336 0.16596343 0.25035162 0.29817159 Sex Survived Male Female No 0.91543624 0.08456376 Yes 0.51617440 0.48382560 Age Survived Child Adult No 0.03489933 0.96510067 Yes 0.08016878 0.91983122

Using a contingency table > predict(mdl, as.data.frame(Titanic)[,1:3]) [1] Yes No No No Yes Yes Yes Yes No No No No Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No [26] No No No Yes Yes Yes Yes Levels: No Yes 24

Naïve Bayes – what is it? Example: testing for a specific item of knowledge that 1% of the population has been informed of (don’t ask how). An imperfect test: –99% of knowledgeable people test positive –99% of ignorant people test negative If a person tests positive – what is the probability that they know the fact? 25

Naïve approach… We have 10,000 representative people 100 know the fact/item, 9,900 do not We test them all: –Get 99 knowing people testing knowing –Get 99 not knowing people testing not knowing –But 99 not knowing people testing as knowing Testing positive (knowing) – equally likely to know or not = 50% 26

Tree diagram 10000 ppl 1% know (100ppl) 99% test to know (99ppl) 1% test not to know (1per) 99% do not know (9900ppl) 1% test to know (99ppl) 99% test not to know (9801ppl) 27

Relation between probabilities For outcomes x and y there are probabilities of p(x) and p (y) that either happened If there’s a connection then the joint probability - both happen = p(x,y) Or x happens given y happens = p(x|y) or vice versa then: –p(x|y)*p(y)=p(x,y)=p(y|x)*p(x) So p(y|x)=p(x|y)*p(y)/p(x) (Bayes’ Law) E.g. p(know|+ve)=p(+ve|know)*p(know)/p(+ve)= (.99*.01)/(.99*.01+.01*.99) = 0.5 28

How do you use it? If the population contains x what is the chance that y is true? p(SPAM|word)=p(word|SPAM)*p(SPAM)/p(w ord) Base this on data: –p(spam) counts proportion of spam versus not –p(word|spam) counts prevalence of spam containing the ‘word’ –p(word|!spam) counts prevalence of non-spam containing the ‘word’ 29

Or.. What is the probability that you are in one class (i) over another class (j) given another factor (X)? Invoke Bayes: Maximize p(X|Ci)p(Ci)/p(X) (p(X)~constant and p(Ci) are equal if not known) So: conditional indep - 30

P(x k | C i ) is estimated from the training samples – Categorical: Estimate P(x k | C i ) as percentage of samples of class i with value x k Training involves counting percentage of occurrence of each possible value for each class –Numeric: Actual form of density function is generally not known, so “normal” density is often assumed 31

Thus.. Supervised or training set needed We will explore this more on Friday 32

Tentative assignments Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ March 7. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ March 18. 5% (0% written and 5% oral; individual); Term project (6). Due ~ week 13. 30% (25% written, 5% oral; individual). Assignment 7: Predictive and Prescriptive Analytics. Due ~ week 9/10. 20% (15% written and 5% oral; individual); 33

Coming weeks I will be out of town Friday March 21 and 28 On March 21 you will have a lab – attendance will be taken – to work on assignments (term (6) and assignment 7). Normal lecture on March 18. On March 28 you will have a lecture on SVM, thus the Tuesday March 25 will be a lab. Back to regular schedule in April (except 18 th ) 34

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not leave a msg)pfox@cs.rpi.edu Contact hours: Monday** 3:00-4:00pm (or by email appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by email) TA: Lakshmi Chenicheri chenil@rpi.educhenil@rpi.edu Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014http://tw.rpi.edu/web/courses/DataAnalytics/2014 –Schedule, lectures, syllabus, reading, assignments, etc. 35

Download ppt "1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference."

Similar presentations