# Lecture Data Mining in R 732A44 Programming in R.

## Presentation on theme: "Lecture Data Mining in R 732A44 Programming in R."— Presentation transcript:

Lecture Data Mining in R 732A44 Programming in R

Logistic regression: two classes Consider Logistic model with one predictor X=Price of the car Y=Equipment Logistic model Use function glm(formula, family, data) – Formula: Response~Model Model consists of a+b (addition), a:b (interaction terms, a*b (addition and interaction). All predictors – Family: specify binomial 732A44 Programming in R

Logistic regression: two classes reg<-glm(X3...Equipment~Price.in.SEK., family=binomial, data=mydata); 732A44 Programming in R

Logistic regression: several predictors Data about contraceptive use – Several analysis plots can be obtained by plot(lrfit) – Response: matrix success/failure 732A44 Programming in R

Logistic regression Further comments Nominal logistic regressions (library mlogit, function mlogit) Stepwise model selection: step() function. Prediction: predict() function 732A44 Programming in R

Smoothing splines Minimize a penalized sum of squared residuals where λ is smoothing parameter. λ=0 : any function interpolating data λ=+  : least squares line fit 732A44 Programming in R

Smoothing splines smooth.spline(x, y, df, spar, cv,…) – Df degrees of freedom – Spar: penalty parameter – CV= TRUE=GCV FALSE=CV NA= no CV plot(m2\$Kilometer,m2\$Price, main="df=40"); res<-smooth.spline( m2\$Kilometer, m2\$Price,df=40); lines(res, col="blue"); plot(m2\$Kilometer,m2\$Price, main="df=40"); res<-smooth.spline( m2\$Kilometer, m2\$Price,df=40); lines(res, col="blue"); 732A44 Programming in R

Generalized additive models A function of the expected response is additive in the set of inputs, i.e., Example: Nonlinear logistic regression of a binary response 732A44 Programming in R

GAM gam(formula,family=gaussian,data,method="GCV.Cp" select=FALSE, sp) – Method: method for selection of smoothing parameters – Select: TRUE – variable selection is performed – Sp: smoothing parameters (maximal df) – Formula: usual terms and spline terms s(…) Library: mgcv Car properties Predict.gam() can be used for predictions bp<-gam(MPG~s(WT, sp=2)+s(SP, sp=1),data=m3) vis.gam(bp, theta=10, phi=30); 732A44 Programming in R

GAM Smoothing components plot(bp, pages=1) 732A44 Programming in R

Principal components analysis Idea: Introduce a new coordinate system (PC1, PC2, …) where The first principal component (PC1) is the direction that maximizes the variance of the projected data The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed … In the new coordinate system, coefficients corresponding to the last principal components are very small  can take away this columns 732A44 Programming in R PC1 PC2

Principal components analysis princomp(x,...) m4<-m3; m4\$MODEL<-c(); res<-princomp(m4); loadings(res); plot(res); biplot(res); summary(res); 732A44 Programming in R

Decision trees 732A44 Programming in R 0 10 20 1020 X1 X2 01X11 01 <9>=9 <16<7>=16>=7 <15>=15

Regression tree example 732A44 Programming in R

Training-validation-test Training-validation (60/40) If training-validation-test is required, use similar strategy sub <- sample(nrow(m2), floor(nrow(m2) * 0.6)) training <- m2[sub, ] validation <- m2[-sub, ] 732A44 Programming in R

Decision trees by CART Growing a full tree Library ”tree”. Create tree: tree(formula, data, subset, split = c("deviance", "gini"),…) – Subset: if subset of cases needs to be used for training – Split: splitting criterion – More parameters with control parameter Prune tree with help of validation set: prune.tree(tree, newdata, method = c("deviance", "misclass”),…) Prune tree with cross-validation: cv.tree(object, FUN = prune.tree, K = 10,...) – K is number of folds in cross-validation 732A44 Programming in R

Classification trees: CART sub <- sample(nrow(m5), floor(nrow(m5) * 0.6)) training <- m5[sub, ] validation <- m5[-sub, ] mytree<-tree(Area~.-Region-X,data=training); summary(mytree) plot(mytree,type="uniform"); text(mytree,cex=0.5); Example: Olive oils in Italy 732A44 Programming in R

Classification trees: CART Dependence of the misclassification rate on the length of the tree: treeseq1<-prune.tree(mytree, newdata=validation,method="misclass") plot(treeseq1); title("Validation"); treeseq2<-cv.tree(mytree, method="misclass") plot(treeseq2); title("CV"); 732A44 Programming in R

Regression trees: CART mytree2<-tree(eicosenoic~linoleic+linolenic+palmitic+palmitoleic,data=training); mytree3<-prune.tree(mytree2, best=4) #totally 4 leaves print(mytree3) summary(mytree3) plot.tree(mytree3) text(mytree3) 732A44 Programming in R

Decision trees: other techniques Conditional inference trees Library: party CART, another library ”rpart” training\$X<-c(); training\$Area<-c(); mytree4<-ctree(Region~.,data=training); print(mytree4) plot(mytree4, type= "simple");# gives nice plots 732A44 Programming in R

Neural network Input nodes, input layer [Hidden nodes, Hidden layer(s)] Output nodes, output layer Weights Activation functions Combination functions 732A44 Programming in R x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … f1f1 fKfK …

Neural networks Feed –forward NNs Library: neuralnet neuralnet(formula, data, hidden = 1, rep = 1, startweights = NULL, algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE,…) – Hidden: vector showing amount of hidden neurons at each layer – Rep: amount of runs of network – Startweights: starting weights – Algorithm: ”backprop”, ”rpprop+”, ”sag”, ”slr” – Err.fct: any function +”sse”+”ce” (cross-entropy) – Act.fct:any function+”logistic”+”tanh” – Linear.output: TRUE, if no activation at the output confidence.interval(x, alpha = 0.05) Confidence intervals for weights compute(x, covariate) Prediction plot(x,…) plot given neural network 732A44 Programming in R

Neural networks Example mynet<-neuralnet( Region~eicosenoic+linoleic+linolenic+palmitic, data=training, rep=5, hidden=c(2,2),act.fct="tanh") plot(mynet); mynet\$result.matrix 732A44 Programming in R

Neural networks Prediction with compute() Finding misclassification rate: table(true_values,predicted values) – not only for neural networks Another package, ready for qualitative response (classical nnet): mynet1<-nnet( Region~eicosenoic+linoleic, data=training, size=3); coef(mynet1) predict(mynet1, data=validation); 732A44 Programming in R

Clustering Purpose is to identify groups of observations into intput space (separated) – K-means – Hierarchical – Density-based 732A44 Programming in R

K-means Amount of seeds K should be given Starting seed positions needed kmeans(x, centers, iter.max = 10, nstart = 1) – X: data frame – Centers: either ”K” value or set of initial cluster centers – Iter.max: maximum number of iterations res<-kmeans(data.frame (m5\$linoleic, m5\$eicosenoic),2); 732A44 Programming in R

K-means One way to visualize plot(m5\$linoleic, m5\$eicosenoic, col=res\$cluster); points(res\$centers[,1], res\$centers[,2], col = 1:2, pch = 8, cex=2) 732A44 Programming in R

Hierarchical clustering Agglomerative – Place each point into a single cluster – Merge nearest clusters until you get 1 cluster Meaning of ”two objects are close”? – Measure of proximity (ex: quantiative vars, Euclidian distance) Similarity measure s rs (=1 if same object, <1 otherwise) – Ex: correlation Dissimilarity measure δ rs (=0 if same object, >0 otherwise) – Ex: euclidian distance 732A44 Programming in R

Hierarchical clustering hclust(d, method = "complete", members=NULL) – D: dissimilarity measure – Method: ”ward”, "single", "complete", "average", "mcquitty", "median" or "centroid". Returned: a tree showing merging sequence cutree(tree, k = NULL, h = NULL) – K: number of clusters to make – H: at which level to cut Returned: cluster index 732A44 Programming in R

Hierarchical clustering Example x<-data.frame(m5\$linolenic, m5\$eicosenoic); m5_dist<-dist(x); m5_dend<-hclust(m5_dist, method="complete") plot(m5_dend); 732A44 Programming in R

Hierarchical clustering Example  DO NOT forget to standardize! clust=cutree(m5_dend, k=2); plot(m5\$linoleic, m5\$eicosenoic, col=clust); 732A44 Programming in R

Density-based clustering Kernel-based density estimation. Library: pdfcluster pdfCluster(x, h = h.norm(x), hmult = 0.75,…) – X: Data to be partitioned – h: a vector of smoothing parameters – Hmult: shrinkage factor x<-data.frame(m5\$linolenic, m5\$eicosenoic); res<-pdfCluster(x); plot(res) 732A44 Programming in R

Reference http://cran.r-project.org/doc/contrib/YanchangZhao- refcard-data-mining.pdf 732A44 Programming in R

Similar presentations