1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means.

1 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014 Lab exercises: regression, kNN and K-means

2 Today Linear regression K Nearest Neighbors K Means 2

3 The Dataset(s) Some new ones; nyt/ and sales/ and the fb100/ (.mat files) 3 script (fragments, i.e. they will not run as-is, I think) to help with code for today: Lab4b_{1,2,3}.R 3

4 Linear and least-squares > multivariate <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<- lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH) 4

5 Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<- data.frame(DALYNEW,AIR_HNEW,WATER_H NEW) > pENV<- predict(lmENV,NEW,interval=“prediction”) > cENV<- predict(lmENV,NEW,interval=“confidence”) 5

6 Repeat for AIR_E CLIMATE 6

7 Remember a few useful cmds head( ) tail( ) summary( ) 7

8 K Nearest Neighbors (classification) > nyt1<-read.csv(“nyt1.csv") > nyt1 0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1]# shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come! 8

9 Regression > bronx<- read.xlsx(”sales/rollingsales_bronx.xls",pattern ="BOROUGH",stringsAsFactors=FALSE,sheetI ndex=1,startRow=5,header=TRUE) > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROS S.SQUARE.FEET),data=bronx)  What’s wrong? 9

10 Clean up… > bronx 0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<- lm(log(bronx$SALE.PRICE)~log(bronx$GROS S.SQUARE.FEET),data=bronx) # > summary(m1) 10

11 Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16 11

12 Plot > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE)) > abline(m1,col="red",lwd=2) # then > plot(resid(m1)) 12

13 Another model (2)? Add two more variables to the linear model LAND.SQUARE.FEET and NEIGHBORHOOD Repeat but suppress the intercept (2a) 13


15 Solution model 2 > m2<- lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEE T)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBO RHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a)) 15

17 Solution model 3 and 4 > m3<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m3) > plot(resid(m3)) # > m4<- lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.F EET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGH BORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),dat a=bronx) > summary(m4) > plot(resid(m4)) 17

19 And now… a complex example > install.packages("geoPlot") > install.packages(”xslx") > require(class) > require(gdata) > require(geoPlot) > require(”xslx”) #if not already read-in: bronx<- read.xlsx(”sales/rollingsales_bronx.xls",pattern ="BOROUGH",stringsAsFactors=FALSE,sheetI ndex=1,startRow=5,header=TRUE) 19

20 View(bronx) #clean up with regular expressions bronx$SALE.PRICE<- as.numeric(gsub("[^]:digit:]]","",bronx$SALE.PRICE)) #missing values? sum($SALE.PRICE)) #zero sale prices sum(bronx$SALE.PRICE==0) #clean these numeric and date fields bronx$GROSS.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$GROSS.SQUARE.FEET)) bronx$LAND.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$LAND.SQUARE.FEET)) bronx$SALE.DATE<- as.Date(gsub("[^]:digit:]]","",bronx$SALE.DATE)) bronx$YEAR.BUILT<- as.numeric(gsub("[^]:digit:]]","",bronx$YEAR.BUILT)) bronx$ZIP.CODE<- as.character(gsub("[^]:digit:]]","",bronx$ZIP.CODE)) 20

21 More corrections #filter out low prices minprice<-10000 bronx =minprice),] #how many left? nval<-dim(bronx)[1] #addresses contain apartment #'s even though there is another column for that - remove them - compresses addresses bronx$ADDRESSONLY<- gsub("[,][[:print:]]*","",gsub("[ ]+","",trim(bronx$ADDRESS))) #new data frame for sorting the addresses, fixing etc. bronxadd<-unique(data.frame(bronx$ADDRESSONLY, bronx$ZIP.CODE,stringsAsFactors=FALSE)) # fix the names names(bronxadd)<-c("ADDRESSONLY","ZIP.CODE") bronxadd<-bronxadd[order(bronxadd$ADDRESSONLY),] 21

22 Yep, more… # duplicates? duplicates<- duplicated(bronxadd$ADDRESSONLY) ##if(duplicates) dupadd<- bronxadd[bronxadd$duplicates,1] ##bronxadd<- bronxadd[(bronxadd$ADDRESSONLY!=dupad d[1] & bronxadd$ADDRESSONLY != dupadd[2]),] #how many? nadd<-dim(bronxadd)[1] 22

23 Oh, we want nearest neighbors? How? #problem, we need a spatial distribution since none of the columns have that #we will use google maps so limit the number to under 500 (ask me why) nsample=450 addsample<-bronxadd[,size=nsample),] #new data frame for the full address addrlist<- data.frame(1:nsample,addsample$ADDRESSONLY,rep("NEW YORK",times=nsample),rep("NY",times=nsample),addsample$ ZIP.CODE,rep("US",times=nsample)) #look them up querylist<-addrListLookup(addrlist) 23

24 Lots missing – why? # how many returned valid lat/long? querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) #MANY missing - what's up? unmatched<- length(unmatchedid) #WEST -> W and EAST -> E - do again. addrlist2<-data.frame(1:unmatched,gsub(" WEST "," W ",gsub(" EAST "," E ",addsample[unmatchedid,1])),rep("NEW YORK",times=unmatched),rep("NY",times=unmatched),addsa mple[unmatchedid,2],rep("US",times=unmatched)) querylist[unmatchedid,1:4]<-addrListLookup(addrlist2)[,1:4] querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) unmatched<- length(unmatchedid) 24

25 Not enough #this fixed a LOT but we need more: STREET and AVENUE (could have done PLACE) and others addrlist3<- data.frame(1:unmatched,gsub("WEST","W",gsub("EAST","E",g sub("STREET","ST ", gsub("AVENUE","AVE", addsample[unmatchedid,1])))),rep("NEW YORK", times=unmatched), rep("NY",times=unmatched), addsample[unmatchedid,2], rep("US",times=unmatched)) querylist[unmatchedid,1:4]<-addrListLookup(addrlist3)[,1:4] querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) unmatched<- length(unmatchedid) # 9 left now? good enough. 25

26 Rebuild! addsample<- cbind(addsample,querylist$latitude,querylist$lo ngitude) ##names(addsample[3:4])<- c("latitude","longitude") - this was meant to correct the column names but did not work for me addsample<- addsample[addsample$'querylist$latitude'!=0,] # note ' ' to work around column name adduse<-merge(bronx,addsample) adduse<-adduse[!$latitude),] 26

27 Most satisfying part! mapcoord<-adduse[,c(2,4,24,25)] table(mapcoord$NEIGHBORHOOD) mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) # geoPlot(mapcoord,zoom=12,color=mapcoord$ NEIGHBORHOOD) 27

29 Did you forget the KNN? #almost there. mapcoord$class<as.numeric(mapcoord$NEIG HBORHOOD) nclass<-dim(mapcoord)[1] split<-0.8 trainid<,floor(split*nclass)) testid<-(1:nclass)[-trainid] ##mappred<-mapcoord[testid,] ##mappred$class<as.numeric(mappred$NEIG HBORHOOD) 29

30 KNN! kmax<-10 knnpred<- matrix(NA,ncol=kmax,nrow=length(testid)) knntesterr<-rep(NA,times=kmax) for (i in 1:kmax){# loop over k knnpred[,i]<- knn(mapcoord[trainid,3:4],mapcoord[testid,3:4], cl=mapcoord[trainid,2],k=i) knntesterr[i]<- sum(knnpred[,i]!=mapcoord[testid,2])/length(tes tid) } 30

31 Finally K-Means! > mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobj,method=c("centers","classes")) > plot(mapmeans,mapobj$cluster) 31

33 Assignment 3 Preliminary and Statistical Analysis. Due ~ Feb 28. 15% (written) –Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets. 33

34 Tentative assignments Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ early March. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual); Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); Term project. Due ~ week 13. 30% (25% written, 5% oral; individual). 34

35 Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact:, 518.276.4862 (do not leave a msg) Contact hours: Monday** 3:00-4:00pm (or by email appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by email) TA: Lakshmi Chenicheri Web site: –Schedule, lectures, syllabus, reading, assignments, etc. 35

36 Table: Matlab/R/scipy-numpy 36

