1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…

coleman > head(coleman) salaryP fatherWc sstatus teacherSc motherLev Y 1 3.83 28.87 7.20 26.6 6.19 37.01 2 2.89 20.10 -11.71 24.4 5.17 26.51 3 2.86 69.05 12.32 25.7 7.04 36.51 4 2.92 65.40 14.28 25.7 7.10 40.70 5 3.06 29.59 6.31 25.4 6.15 37.10 6 2.07 44.82 6.16 21.6 6.41 33.90 2

Cross-validation package cvTools > call <- call("lmrob", formula = Y ~.) > # set up folds for cross-validation > folds <- cvFolds(nrow(coleman), K = 5, R = 10) > # perform cross-validation > cvTool(call, data = coleman, y = coleman$Y, cost = rtmspe, + folds = folds, costArgs = list(trim = 0.1)) CV [1,] 0.9880672 [2,] 0.9525881 [3,] 0.8989264 [4,] 1.0177694 [5,] 0.9860661 [6,] 1.8369717 [7,] 0.9550428 [8,] 1.0698466 [9,] 1.3568537 [10,] 0.8313474 3 Warning messages: 1: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 2: In lmrob.S(x, y, control = control) : S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps 3: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations 4: In lmrob.S(x, y, control = control) : find_scale() did not converge in 'maxit.scale' (= 200) iterations

Evaluating? > cvFits 5-fold CV results: Fit CV 1 LS 1.674485 2 MM 1.147130 3 LTS 1.291797 Best model: CV "MM" 4

LS, LTS, MM? The breakdown value of an estimator is defined as the smallest fraction of contamination that can cause the estimator to take on values arbitrarily far from its value on the uncontaminated data. The breakdown value of an estimator can be used as a measure of the robustness of the estimator. Rousseeuw and Leroy (1987) and others introduced high breakdown value estimators for linear regression. LTS – see http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/ viewer.htm#statug_rreg_sect018.htm#statug.rreg.robustregfltsest http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/ viewer.htm#statug_rreg_sect018.htm#statug.rreg.robustregfltsest MM - http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/ viewer.htm#statug_rreg_sect019.htm http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/ viewer.htm#statug_rreg_sect019.htm 5

50 and 75% subsets fitLts50 <- ltsReg(Y ~., data = coleman, alpha = 0.5) cvFitLts50 <- cvLts(fitLts50, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # 75% subsets fitLts75 <- ltsReg(Y ~., data = coleman, alpha = 0.75) cvFitLts75 <- cvLts(fitLts75, cost = rtmspe, folds = folds, fit = "both", trim = 0.1) # combine and plot results cvFitsLts <- cvSelect("0.5" = cvFitLts50, "0.75" = cvFitLts75) 6

cvFitsLts (50/75) > cvFitsLts 5-fold CV results: Fit reweighted raw 1 0.5 1.291797 1.640922 2 0.75 1.065495 1.232691 Best model: reweighted raw "0.75" "0.75" 7

Tuning tuning <- list(tuning.psi=c(3.14, 3.44, 3.88, 4.68)) # perform cross-validation cvFitsLmrob <- cvTuning(fitLmrob$call, data = coleman, y = coleman$Y, tuning = tuning, cost = rtmspe, folds = folds, costArgs = list(trim = 0.1)) 8

cvFitsLmrob 5-fold CV results: tuning.psi CV 1 3.14 1.179620 2 3.44 1.156674 3 3.88 1.169436 4 4.68 1.133975 Optimal tuning parameter: tuning.psi CV 4.68 9

Lab on Friday mammals.glm <- glm(log(brain) ~ log(body), data = mammals) (cv.err <- cv.glm(mammals, mammals.glm)$delta) [1] 0.4918650 0.4916571 > (cv.err.6 <- cv.glm(mammals, mammals.glm, K = 6)$delta) [1] 0.4967271 0.4938003 # As this is a linear model we could calculate the leave-one-out # cross-validation estimate without any extra model-fitting. muhat <- fitted(mammals.glm) mammals.diag <- glm.diag(mammals.glm) (cv.err <- mean((mammals.glm$y - muhat)^2/(1 - mammals.diag$h)^2)) [1] 0.491865 10

Cost functions, etc. # leave-one-out and 11-fold cross-validation prediction error for # the nodal data set. Since the response is a binary variable # an appropriate cost function is > cost 0.5) > nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal) > (cv.err <- cv.glm(nodal, nodal.glm, cost, K = nrow(nodal))$delta) [1] 0.1886792 0.1886792 > (cv.11.err <- cv.glm(nodal, nodal.glm, cost, K = 11)$delta) [1] 0.2264151 0.2228551 11

cvTools http://cran.r- project.org/web/packages/cvTools/cvTools.pd fhttp://cran.r- project.org/web/packages/cvTools/cvTools.pd f Very powerful and flexible package for CV (regression) but very much a black box! If you use it, become very, very familiar with the outputs and be prepared to experiment… 12

Diamonds require(ggplot2) # or load package first data(diamonds) head(diamonds) # look at the data! # ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar() ggplot(diamonds, aes(clarity)) + geom_bar() + facet_wrap(~ cut) ggplot(diamonds) + geom_histogram(aes(x=price)) + geom_vline(xintercept=12000) ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut)) 13

14 ggplot(diamonds, aes(clarity)) + geom_freqpoly(aes(group = cut, colour = cut))

bodyfat ## regular linear model using three variables lm1 <- lm(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) ## Estimate same model by glmboost glm1 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat) # We consider all available variables as potential predictors. glm2 <- glmboost(DEXfat ~., data = bodyfat) # or one could essentially call: preds <- names(bodyfat[, names(bodyfat) != "DEXfat"]) ## names of predictors fm <- as.formula(paste("DEXfat ~", paste(preds, collapse = "+"))) ## build formula 15

Compare linear models > coef(lm1) (Intercept) hipcirc kneebreadth anthro3a -75.2347840 0.5115264 1.9019904 8.9096375 > coef(glm1, off2int=TRUE) ## off2int adds the offset to the intercept (Intercept) hipcirc kneebreadth anthro3a -75.2073365 0.5114861 1.9005386 8.9071301 Conclusion? 16

> fm DEXfat ~ age + waistcirc + hipcirc + elbowbreadth + kneebreadth + anthro3a + anthro3b + anthro3c + anthro4 > coef(glm2, which = "") ## select all. (Intercept) age waistcirc hipcirc elbowbreadth kneebreadth anthro3a anthro3b anthro3c -98.8166077 0.0136017 0.1897156 0.3516258 - 0.3841399 1.7365888 3.3268603 3.6565240 0.5953626 anthro4 0.0000000 attr(,"offset") [1] 30.78282 17

> gam2 <- gamboost(DEXfat ~., baselearner = "bbs", data = bodyfat,control = boost_control(trace = TRUE)) [ 1].................................................. -- risk: 515.5713 [ 53].............................................. Final risk: 460.343 > set.seed(123) ## set seed to make results reproducible > cvm <- cvrisk(gam2) ## default method is 25-fold bootstrap cross-validation 18

> cvm Cross-validated Squared Error (Regression) gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) 1 2 3 4 5 6 7 8 9 10 109.44043 93.90510 80.59096 69.60200 60.13397 52.59479 46.11235 40.80175 36.32637 32.66942 11 12 13 14 15 16 17 18 19 20 29.66258 27.07809 24.99304 23.11263 21.55970 20.40313 19.16541 18.31613 17.59806 16.96801 21 22 23 24 25 26 27 28 29 30 16.48827 16.07595 15.75689 15.47100 15.21898 15.06787 14.96986 14.86724 14.80542 14.74726 31 32 33 34 35 36 37 38 39 40 14.68165 14.68648 14.64315 14.67862 14.68193 14.68394 14.75454 14.80268 14.81760 14.87570 41 42 43 44 45 46 47 48 49 50 14.90511 14.92398 15.00389 15.03604 15.07639 15.10671 15.15364 15.20770 15.23825 15.30189 51 52 53 54 55 56 57 58 59 60 15.31950 15.35630 15.41134 15.46079 15.49545 15.53137 15.57602 15.61894 15.66218 15.71172 61 62 63 64 65 66 67 68 69 70 15.72119 15.75424 15.80828 15.84097 15.89077 15.90547 15.93003 15.95715 15.99073 16.03679 71 72 73 74 75 76 77 78 79 80 16.06174 16.10615 16.12734 16.15830 16.18715 16.22298 16.27167 16.27686 16.30944 16.33804 81 82 83 84 85 86 87 88 89 90 16.36836 16.39441 16.41587 16.43615 16.44862 16.48259 16.51989 16.52985 16.54723 16.58531 91 92 93 94 95 96 97 98 99 100 16.61028 16.61020 16.62380 16.64316 16.64343 16.68386 16.69995 16.73360 16.74944 16.75756 Optimal number of boosting iterations: 33 19

> mstop(cvm) ## extract the optimal mstop [1] 33 > gam2[ mstop(cvm) ] ## set the model automatically to the optimal mstop Model-based Boosting Call: gamboost(formula = DEXfat ~., data = bodyfat, baselearner = "bbs", control = boost_control(trace = TRUE)) Squared Error (Regression) Loss function: (y - f)^2 Number of boosting iterations: mstop = 33 Step size: 0.1 Offset: 30.78282 Number of baselearners: 9 20

plot(cvm) 21

> names(coef(gam2)) ## displays the selected base-learners at iteration 30 [1] "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" "bbs(kneebreadth, df = dfbase)" [4] "bbs(anthro3a, df = dfbase)" "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" [7] "bbs(anthro4, df = dfbase)" > gam2[1000, return = FALSE] # return = FALSE just supresses "print(gam2)" [ 101].................................................. -- risk: 423.9261 [ 153].................................................. -- risk: 397.4189 [ 205].................................................. -- risk: 377.0872 [ 257].................................................. -- risk: 360.7946 [ 309].................................................. -- risk: 347.4504 [ 361].................................................. -- risk: 336.1172 [ 413].................................................. -- risk: 326.277 [ 465].................................................. -- risk: 317.6053 [ 517].................................................. -- risk: 309.9062 [ 569].................................................. -- risk: 302.9771 [ 621].................................................. -- risk: 296.717 [ 673].................................................. -- risk: 290.9664 [ 725].................................................. -- risk: 285.683 [ 777].................................................. -- risk: 280.8266 [ 829].................................................. -- risk: 276.3009 [ 881].................................................. -- risk: 272.0859 [ 933].................................................. -- risk: 268.1369 [ 985].............. Final risk: 266.9768 22

> names(coef(gam2)) ## displays the selected base-learners, now at iteration 1000 [1] "bbs(age, df = dfbase)" "bbs(waistcirc, df = dfbase)" "bbs(hipcirc, df = dfbase)" [4] "bbs(elbowbreadth, df = dfbase)" "bbs(kneebreadth, df = dfbase)" "bbs(anthro3a, df = dfbase)" [7] "bbs(anthro3b, df = dfbase)" "bbs(anthro3c, df = dfbase)" "bbs(anthro4, df = dfbase)” > glm3 <- glmboost(DEXfat ~ hipcirc + kneebreadth + anthro3a, data = bodyfat,family = QuantReg(tau = 0.5), control = boost_control(mstop = 500)) > coef(glm3, off2int = TRUE) (Intercept) hipcirc kneebreadth anthro3a -63.5164304 0.5331394 0.7699975 7.8350858 23

More local methods… 24

Why local? 25

Sparse? 26

Remember this one? 27 How would you apply local methods here?

SVM-type One-class-classification: this model tries to find the support of a distribution and thus allows for outlier/novelty detection; epsilon-regression: here, the data points lie in between the two borders of the margin which is maximized under suitable conditions to avoid outlier inclusion; nu-regression: with analogue modifications of the regression model as in the classification case. 28

Reminder SVM and margin 29

Loss functions… 30 classification outlier regression

Regression By using a different loss function called the ε- insensitive loss function ||y−f (x)||ε = max{0, ||y− f(x)|| − ε}, SVMs can also perform regression. This loss function ignores errors that are smaller than a certain threshold ε > 0 thus creating a tube around the true output. 31

Example lm v. svm 32

Again SVM in R E1071 - the svm() function in e1071 provides a rigid interface to libsvm along with visualization and parameter tuning methods. kernlab features a variety of kernel-based methods and includes a SVM method based on the optimizers used in libsvm and bsvm Package klaR includes an interface to SVMlight, a popular SVM implementation that additionally offers classification tools such as Regularized Discriminant Analysis. Svmpath – you get the idea… 34

Knn is local – right? nearest neighbors is a simple algorithm that stores all available cases and predict the numerical target based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique. 35

Distance… A simple implementation of KNN regression is to calculate the average of the numerical target of the K nearest neighbors. Another approach uses an inverse distance weighted average of the K nearest neighbors. Choosing K! KNN regression uses the same distance functions as KNN classification. knn.reg and also in kknn http://cran.r- project.org/web/packages/kknn/kknn.pdfhttp://cran.r- project.org/web/packages/kknn/kknn.pdf 36

Classes of local regression Locally (weighted) scatterplot smoothing –LOESS –LOWESS Fitting is done locally - the fit at point x, the fit is made using points in a neighborhood of x, weighted by their distance from x (with differences in ‘parametric’ variables being ignored when computing the distance) 37

Classes of local regression The size of the neighborhood is controlled by α (set by span). For α 1, all points are used, with the ‘maximum distance’ assumed to be α^(1/p) times the actual maximum distance for p explanatory variables. 39

Classes of local regression For the default family, fitting is by (weighted) least squares. For family="symmetric" a few iterations of an M-estimation procedure with Tukey's biweight are used. Be aware that as the initial value is the least- squares fit, this need not be a very resistant fit. It can be important to tune the control list to achieve acceptable speed. 40

Friedman (supsmu in modreg) is a running lines smoother which chooses between three spans for the lines. The running lines smoothers are symmetric, with k/2 data points each side of the predicted point, and values of k as 0.5 * n, 0.2 * n and 0.05 * n, where n is the number of data points. If span is specified, a single smoother with span span * n is used. 41

Friedman The best of the three smoothers is chosen by cross-validation for each prediction. The best spans are then smoothed by a running lines smoother and the final prediction chosen by linear interpolation. “For small samples (n 0) should be used. Reasonable span values are 0.2 to 0.4.” 42

Local non-param lplm (in Rearrangement) Local nonparametric method, local linear regression estimator with box kernel (default), for conditional mean functions 43

Ridge regression Addresses ill-posed regression problems using filtering approaches (e.g. high-pass) Often called “regularization” lm.ridge (in MASS) 44

Quantile regression –is desired if conditional quantile functions are of interest. One advantage of quantile regression, relative to the ordinary least squares regression, is that the quantile regression estimates are more robust against outliers in the response measurements. –In practice we often prefer using different measures of central tendency and statistical dispersion to obtain a more comprehensive analysis of the relationship between variables. 45

More… Partial Least Squares Regression (PLSR) mvr (in pls) Principal Component Regression (PCR) Canonical Powered Partial Least Squares (CPPLS) 46

PCR creates components to explain the observed variability in the predictor variables, without considering the response variable at all. On the other hand, PLSR does take the response variable into account, and therefore often leads to models that are able to fit the response variable with fewer components. Whether or not that ultimately translates into a better model, in terms of its practical use, depends on the context. 47

Splines smooth.spline, splinefun (stats, modreg) and ns (in splines) –http://www.inside-r.org/r-doc/splineshttp://www.inside-r.org/r-doc/splines a numeric function that is piecewise-defined by polynomial functions, and which possesses a sufficiently high degree of smoothness at the places where the polynomial pieces connect (which are known as knots) 48

Splines For interpolation, splines are often preferred to polynomial interpolation - they yields similar results to interpolating with higher degree polynomials while avoiding instability due to overfitting Features: simplicity of their construction, their ease and accuracy of evaluation, and their capacity to approximate complex shapes Most common: cubic spline, i.e., of order 3— in particular, cubic B-spline 49

cars 50

Smoothing/ local … https://web.njit.edu/all_topics/Prog_Lang_Doc s/html/library/modreg/html/00Index.htmlhttps://web.njit.edu/all_topics/Prog_Lang_Doc s/html/library/modreg/html/00Index.html http://cran.r-project.org/doc/contrib/Ricci- refcard-regression.pdfhttp://cran.r-project.org/doc/contrib/Ricci- refcard-regression.pdf 51

Lab on Friday… And reminder – Assignment 7 due on Apr. 29 – Friday 5pm Next week – mixed models! i.e. optimizing… Open lab on Fri. 29 th (no new work)…. 52

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…

Similar presentations

Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…

Similar presentations

Presentation on theme: "1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 12a, April 19, 2016 Cross-validation, Revisiting Regression – local models, and non-parametric…"— Presentation transcript:

Similar presentations

About project

Feedback