Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 4.2 Regression Topics

Similar presentations


Presentation on theme: "Chapter 4.2 Regression Topics"— Presentation transcript:

1 Chapter 4.2 Regression Topics
Credits Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth Lecture Notes Wolfgang Jank Lecture Notes Data Mining Volinsky - Columbia University

2 Data Mining - 2011 - Volinsky - Columbia University
Regression Review Linear Regression models a numeric outcome as a linear function of several predictors. It is the king of all statistical and data mining models ease of interpretation mathematically concise tends to perform well for prediction, even under violations of assumptions Characteristics numeric response - ideally real valued numeric predictors- but not necessarily Data Mining Volinsky - Columbia University

3 Linar Regression Model
Basic model: you are not modelling y, but you are modelling the mean of y for a given x! Simple Regression - one x. easy to describe, good for mathematics, but not used often in data mining Multiple regression - many x - response surface is a plane…harder to conceptualize Useful as a baseline model Data Mining Volinsky - Columbia University

4 Linear Regression Model
Assumptions: linearity constant variance normality of errors residuals ~ Normal(mu,sigma^2) Assumptions must be checked, but if inference is not the goal, you can accept some deviation from assumptions (don’t’ tell the statisticians I said that!) Multicollinearity also an issue creates unstable estimates Data Mining Volinsky - Columbia University

5 Data Mining - 2011 - Volinsky - Columbia University
Fitting the Model We can look at regression as a matrix problem We want a score function which minimizes “a”: = which is minimized by Data Mining Volinsky - Columbia University

6 Fitting models: in-sample
Minimize the sum of the squared errors: S = S e2 = e’ e = (y – X a)’ (y – X a) = y’ y – a’ X’ y – y’ X a + a’ X’ X a = y’ y – 2 a’ X’ y + a’ X’ X a Take derivative of S with respect to a: dS/da = -2X’y X’ X a Set this to 0 to find the (minimum) of S as a function of a… - 2X’y X’ X a = 0 X’Xa = X’ y a = ( X’ X )-1 X’ y Prediction follows easily: Data Mining Volinsky - Columbia University

7 Fitting regression: out-of-sample
Can also optimize “a” based on a hold-out sample and a search over all “a”s But how to search over all values of all a’s? This will minimize MSE – might give a different answer MSE=Bias + Variance Because of the nice algebraic form, typically in-sample is used But different loss function may change things R2 measures a ratio between regression sum of squares - how much of the variance does the regression explain, and the total sum of squares - how much variation is there altogether If it is close to 1, your fit is good. But be careful. Data Mining Volinsky - Columbia University

8 Limitations of Linear Regression
True relationship of X and Y might be non-linear Suggests generalizations to non-linear models Correlation/Collinearity among the X variables Can cause numerical instability Problems in interpretability (identifiability) Includes all variables in the model… But what if p=100 and only 3 variables are related to Y? Data Mining Volinsky - Columbia University

9 Data Mining - 2011 - Volinsky - Columbia University
Checking assumptions linearity look to see if transformations make relationships ‘more’ linear normality of errors Histograms and qqplots Non-constant variance Beware of ‘fanning’ residuals Time effects Can be revealed in an ordering plot Influence Use hat matrix Data Mining Volinsky - Columbia University

10 Data Mining - 2011 - Volinsky - Columbia University
Checking influence Influence ^ H is called the hat matrix (why?): The element of H for a given observation is its influence The leverage hi quantifies the influence that the observed response yi has on its predicted value y It measures the distance between the X values for the ith case and the means of the X values for all n cases. influence hi is a number between 0 and 1 inclusive. Data Mining Volinsky - Columbia University

11 Influence Measures for Linear Model
There are a few quite influential (and extreme) points… What to do? Data Mining Volinsky - Columbia University

12 Data Mining - 2011 - Volinsky - Columbia University
Diagnostic Plots Data Mining Volinsky - Columbia University

13 Data Mining - 2011 - Volinsky - Columbia University

14 Model selection: finding the best k variables
If noisy variables are included in the model, it can effect the overall performance. Best to remove an predictors which have no effect, lest random patterns look significant. Searching all possible models How many are there? Heuristic search is used to search over model space: Forward or backward stepwise search Leaps and bound techniques do exhaustive search In-sample: penalize for complexity (AIC, BIC, Mallow’s Cp) Out-of-sample: use cross validation Data Mining Volinsky - Columbia University

15 Data Mining - 2011 - Volinsky - Columbia University
R ‘step’: uses AIC Data Mining Volinsky - Columbia University

16 Data Mining - 2011 - Volinsky - Columbia University
Leaps output R ‘leaps’ : uses Cp Data Mining Volinsky - Columbia University

17 Generalizing Linear Regression
Data Mining Volinsky - Columbia University

18 Complexity versus Goodness of Fit
Training data y x Data Mining Volinsky - Columbia University

19 Complexity versus Goodness of Fit
Training data Too simple? y y x x Data Mining Volinsky - Columbia University

20 Complexity versus Goodness of Fit
Training data Too simple? y y x x Too complex ? y x Data Mining Volinsky - Columbia University

21 Complexity versus Goodness of Fit
Training data Too simple? y y x x Too complex ? About right ? y y x x Data Mining Volinsky - Columbia University

22 Complexity and Generalization
Score Function e.g., squared error Stest(q) Strain(q) Complexity = degrees of freedom in the model (e.g., number of variables) Optimal model complexity Data Mining Volinsky - Columbia University

23 Non-linear models, linear in parameters
We can add additional polynomial terms in our equations, non-linear functional form, but linear in the parameters (so still referred to as “linear regression”) We can just treat the xi xj terms as additional fixed inputs In fact we can add in any non-linear input functions!, e.g. Comments: Number of parameters can explode => greater chance of overfitting Adding complexity: must use penalties! Data Mining Volinsky - Columbia University

24 Non-linear (both model and parameters)
We can generalize further to models that are nonlinear in all aspects where the g’s are non-linear functions (k of them) This is called a Neural Network (we’ll talk about it later) Closed form (analytical) solutions are rare. This is a a multivariate non-linear optimization problem (which may be quite difficult!) Data Mining Volinsky - Columbia University

25 Generalizing Regression
Generalized Linear Models (GLM) linear combination of the predictors independent RV with distribution based on the error term function which connects the two GLMs are defined by error structure (Gaussian, Poisson, Binomial) linear predictor (single variables, interactions, polynomials) link function (identity, log, reciprocal) Data Mining Volinsky - Columbia University

26 Data Mining - 2011 - Volinsky - Columbia University
Logistic Regression Logistic regression is the most common GLM. response in this case is binary (0,1). (Y follows a bernoulli or Binomial distribution) we model the probability of a 1 (p) occurring. for mathematical convenience, we model the odds: p/(1-p) log odds are even better - logit function scales on the real line, rather than [0,1] Deviance: -2 x (difference in log-likelihood from saturated model) Data Mining Volinsky - Columbia University

27 Data Mining - 2011 - Volinsky - Columbia University
Logistic Regression Interpretation of coefficients changes! Data Mining Volinsky - Columbia University

28 Data Mining - 2011 - Volinsky - Columbia University
Logistic example womensrole data (R handbook) Survey in 1975: “Women should take care of running their homes and leave running the coutnry up to men” education sex agree disagree Male Male Male Male Male Male Male Male Male Male Male Data Mining Volinsky - Columbia University

29 Womensrole Logistic fit
Data Mining Volinsky - Columbia University

30 Data Mining - 2011 - Volinsky - Columbia University
Other GLMs Another useful GLM is for count data model Y ~ Poisson(lambda) link is log(Y) Also called ‘log-linear’ models Typically used for counts: People at a store Calls at a help center Spams in an hour Data Mining Volinsky - Columbia University

31 Shrinkage Models: Ridge Regression
Variable selection is a binary process That makes it high variance: small changes can effect final model Can we have a more continuous process, where each variable is ‘partly’ included? Ridge regression “shrinks” coefficients on by imposing a penalty for the model “size” Minimize the penalized sum of squares: L is a complexity parameter which controls the amount of shrinkage - the larger l is, the more the coefficients are shrunk towards 0. Data Mining Volinsky - Columbia University

32 Data Mining - 2011 - Volinsky - Columbia University
Ridge Regression Model is imposing a penalty on the coefficient size Since a’s depend on the units, care must be taken to standardize inputs. Also, you can show that the ridge estimates are a linear function of y: this adds a positive constant to the diagonal and allows inverision even if the matrix is not full rank So, can be used in cases where p > n! In general: increasing bias, decreasing variance Often decreases MSE Data Mining Volinsky - Columbia University

33 Data Mining - 2011 - Volinsky - Columbia University
Ridge coefficients df(l) is a one-to-one monotone function of l such that df(l) ranges from 0 to p. l= 0; s=p : least squares solution; p degrees of freedom l= inf; s=0; heaviest shrinkage; all parameter estimates = 0; zero degrees of freedom Look at plot as a function of degrees of freedom df(l) Data Mining Volinsky - Columbia University

34 Data Mining - 2011 - Volinsky - Columbia University
Lasso Very similar to ridge with one important difference: L2 penalty replaced by L1 has an interesting effect on the profile plot: if lambda is large then estimates go to zero continuous variable selection s=1 is least squares answer s=0 all estimates are 0 s=0.5 was the value chosen by cross validation Data Mining Volinsky - Columbia University

35 Data Mining - 2011 - Volinsky - Columbia University
lasso coefficients Note how parameters shrink to zero! This is the appeal of lasso (in addition to good performance) Data Mining Volinsky - Columbia University s = df(l) / p

36 Principal Components Regression
Create PC from the original data vectors and use them in any of the above regression schemes Removes the ‘less important’ parts of the data space, while creating a reduced data set Since each PC is a linear combination of the original variables, we can express the solution in terms of the initial coefficients. Data Mining Volinsky - Columbia University

37 Comparison of results (prostate data)
Term LS Best Subset Ridge Lasso PCR Intercept 2.465 2.477 2.452 2.468 2.497 Lcavol 0.680 0.740 0.420 0.533 0.543 Lwight 0.236 0.316 0.238 0.169 0.289 Age -0.141 -0.046 -0.152 Lbph 0.210 0.162 0.002 0.214 Svi 0.305 0.227 0.094 0.315 Lcp -0.288 0.000 -0.051 Gleason -0.021 0.040 0.232 Pgg45 0.267 0.133 -0.056 Test Error 0.521 0.492 0.479 0.449 Std Error 0.179 0.143 0.165 0.164 0.105 Cross validation allows all of these different methods to be comparable to each other Data Mining Volinsky - Columbia University

38 Nonparametric Modeling
A nonparametric model does not assume any parameters to be estimated (thus the name nonparametric) Its general form is Y = f(X) + ε Typically, we only assume that f() is some smooth, continuous function Also, we typically assume independent and identically distributed errors, ε~N(0,σ^2), but that’s not necessary. 1-D nonparametric regression = density estimation Data Mining Volinsky - Columbia University

39 Advantages & Disadvantages
More flexibility leads to better data-fit, often also to better predictive capabilities Smoothness can also lead to entirely new concepts, such as dynamics (via derivatives) and thus to flexible differential equation models, etc Disadvantage Much more complexity, hard to explain Data Mining Volinsky - Columbia University

40 Fitting Nonparametric models
How do we estimate the function f()? Restrictions on f: smoothness, continuity, existence of the first and second derivatives options for estimating f include scatterplot smoothers, regression splines, smoothing splines, B-splines, thin-plate splines, wavelets, and many, many more… one particularly popular option, the smoothing spline Data Mining Volinsky - Columbia University

41 Data Mining - 2011 - Volinsky - Columbia University
Splines Splines are piecewise polynomials smoothly connected together. The joining points of the polynomial pieces are called knots. Smoothing splines are splines that are penalized against too much local variability (and thus appear smoother) Must be differentiable at the knots linear spline: 0-times differentiable cubic spline: twice differentiable Data Mining Volinsky - Columbia University

42 Piecewise Polynomial cont.
Piecewise constant and piecewise linear “Knots” Data Mining Volinsky - Columbia University

43 Spline cont. (Linear Spline)
Data Mining Volinsky - Columbia University

44 Spline cont. (Cubic Spline)
Data Mining Volinsky - Columbia University

45 Definition of Smoothing Splines
Smoothing Splines arise as the solution to the following simple regression problem Find a piecewise polynomial f(x) with smooth breakpoints f(x) minimizes the penalized sum-of-squares Between knots the polynomials have degree 2n-1 fit curvature Data Mining Volinsky - Columbia University

46 Example of Smoothing Splines
Two Smoothing Splines fit to the Prestige Data Little smoothing, λ small (red line) Heavy smoothing, λ large (blue line) Data Mining Volinsky - Columbia University

47 The smoothing parameter
The magnitude of λ affects the quality of the smoother; many ad-hoc approaches to find a “good” smoothing parameter Visual trial and error Minimize mean-squared error of the fit Cross-validation, optimization on hold-out sample, etc Data Mining Volinsky - Columbia University

48 Prestige Data Revisited
Education (X1) and Income (X2) influence the perceived Prestige (Y) of a profession Is there a linear relationship between the X’s and Y? If we’re not sure of the type of relationship between X and Y, nonparametric regression can be a very useful exploratory tool. Data Mining Volinsky - Columbia University

49 Additive Model Estimates
Parametric coefficients: Estimate std. err. t ratio Pr(>|t|) constant <2e-16 Approximate significance of smooth terms: edf chi.sq p-value s(income) e-10 s(education) <2e-16 R-sq.(adj) = Deviance explained = 84.7% GCV score = Intercept! Inference for Income and Education, similar to F-test Measures of model fit Data Mining Volinsky - Columbia University

50 Compare to Classical Regression
Parametric coefficients: Estimate std. err. t ratio Pr(>|t|) (Intercept) income e-08 education <2e-16 R-sq.(adj) = Deviance explained = 79.8% GCV score = Better model fit for the nonparametric model!! Data Mining Volinsky - Columbia University

51 Function Estimates from Additive Regression Model
What is the nature of the relationship of the individual predictor variables and prestige? Data Mining Volinsky - Columbia University


Download ppt "Chapter 4.2 Regression Topics"

Similar presentations


Ads by Google