Data Mining - 2011 - Volinsky - Columbia University 1 Chapter 4.2 Regression Topics Credits Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth Lecture.

Slides:

Advertisements

Similar presentations

Chapter Outline 3.1 Introduction

Advertisements

Linear Regression.

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.

Copyright © 2010 Pearson Education, Inc. Slide

Probability & Statistical Inference Lecture 9

The General Linear Model Or, What the Hell’s Going on During Estimation?

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.

Model assessment and cross-validation - overview

Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.

Data mining and statistical learning - lecture 6

Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.

Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.

Chapter 13 Multiple Regression

Section 4.2 Fitting Curves and Surfaces by Least Squares.

Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

x – independent variable (input)

Chapter 12 Multiple Regression

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.

Multivariate Data Analysis Chapter 4 – Multiple Regression.

Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review

Lecture 6: Multiple Regression

Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,

Log-linear and logistic models

1 Linear Methods for Regression Lecture Notes for CMPUT 466/551 Nilanjan Ray.

Linear and generalised linear models

An Introduction to Logistic Regression

Business Statistics - QBM117 Statistical inference for regression.

Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.

Linear Regression/Correlation

Classification and Prediction: Regression Analysis

Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.

Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.

Objectives of Multiple Regression

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Regression and Correlation Methods Judy Zhong Ph.D.

Inference for regression - Simple linear regression

Simple Linear Regression

© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.

© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

Chapter 3: Generalized Linear Models 3.1 The Generalization 3.2 Logistic Regression Revisited 3.3 Poisson Regression 1.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.

MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.

Linear Models Alan Lee Sample presentation for STATS 760.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.

McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.

Machine Learning 5. Parametric Methods.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.

Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.

Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.

Stats Methods at IC Lecture 3: Regression.

Chapter 4.2 Regression Topics

Deep Feedforward Networks

Boosting and Additive Trees (2)

Statistics in MSmcDESPOT

Linear Model Selection and regularization

Basis Expansions and Generalized Additive Models (2)

Generalized Additive Model

MGS 3100 Business Analysis Regression Feb 18, 2016

Presentation transcript:

Data Mining Volinsky - Columbia University 1 Chapter 4.2 Regression Topics Credits Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth Lecture Notes Wolfgang Jank Lecture Notes

Data Mining Volinsky - Columbia University 2 Regression Review Linear Regression models a numeric outcome as a linear function of several predictors. It is the king of all statistical and data mining models –ease of interpretation –mathematically concise –tends to perform well for prediction, even under violations of assumptions Characteristics –numeric response - ideally real valued –numeric predictors- but not necessarily

Data Mining Volinsky - Columbia University 3 Linar Regression Model Basic model: you are not modelling y, but you are modelling the mean of y for a given x! Simple Regression - one x. –easy to describe, good for mathematics, but not used often in data mining Multiple regression - many x - – response surface is a plane…harder to conceptualize Useful as a baseline model

Data Mining Volinsky - Columbia University 4 Linear Regression Model Assumptions: –linearity –constant variance –normality of errors residuals ~ Normal(mu,sigma^2) Assumptions must be checked, –but if inference is not the goal, you can accept some deviation from assumptions (don’t’ tell the statisticians I said that!) Multicollinearity also an issue –creates unstable estimates

Data Mining Volinsky - Columbia University 5 Fitting the Model We can look at regression as a matrix problem We want a score function which minimizes “a”: = which is minimized by

Fitting models: in-sample Minimize the sum of the squared errors: S =  e 2 = e’ e = (y – X a)’ (y – X a) = y’ y – a’ X’ y – y’ X a + a’ X’ X a = y’ y – 2 a’ X’ y + a’ X’ X a Take derivative of S with respect to a: dS/da = -2X’y + 2 X’ X a Set this to 0 to find the (minimum) of S as a function of a…  - 2X’y + 2 X’ X a = 0  X’Xa = X’ y  a = ( X’ X ) -1 X’ y  Prediction follows easily: Data Mining Volinsky - Columbia University 6

Fitting regression: out-of-sample Can also optimize “a” based on a hold-out sample and a search over all “a”s –But how to search over all values of all a’s? –This will minimize MSE – might give a different answer MSE=Bias + Variance Because of the nice algebraic form, typically in- sample is used –But different loss function may change things –R 2 measures a ratio between regression sum of squares - how much of the variance does the regression explain, and the total sum of squares - how much variation is there altogether –If it is close to 1, your fit is good. But be careful. Data Mining Volinsky - Columbia University 7

8 Limitations of Linear Regression True relationship of X and Y might be non-linear –Suggests generalizations to non-linear models Correlation/Collinearity among the X variables –Can cause numerical instability –Problems in interpretability (identifiability) Includes all variables in the model… –But what if p=100 and only 3 variables are related to Y?

Data Mining Volinsky - Columbia University 9 Checking assumptions linearity –look to see if transformations make relationships ‘more’ linear normality of errors –Histograms and qqplots Non-constant variance –Beware of ‘fanning’ residuals Time effects –Can be revealed in an ordering plot Influence –Use hat matrix

Data Mining Volinsky - Columbia University 10 Checking influence Influence H is called the hat matrix (why?): The element of H for a given observation is its influence The leverage h i quantifies the influence that the observed response y i has on its predicted value y It measures the distance between the X values for the i th case and the means of the X values for all n cases. influence h i is a number between 0 and 1 inclusive. ^

Influence Measures for Linear Model There are a few quite influential (and extreme) points… What to do? 11 Data Mining Volinsky - Columbia University

12 Diagnostic Plots

Data Mining Volinsky - Columbia University 13

Data Mining Volinsky - Columbia University 14 Model selection: finding the best k variables If noisy variables are included in the model, it can effect the overall performance. Best to remove an predictors which have no effect, lest random patterns look significant. Searching all possible models –How many are there? –Heuristic search is used to search over model space: Forward or backward stepwise search Leaps and bound techniques do exhaustive search –In-sample: penalize for complexity (AIC, BIC, Mallow’s C p ) –Out-of-sample: use cross validation

Data Mining Volinsky - Columbia University 15 R ‘step’: uses AIC

Leaps output Data Mining Volinsky - Columbia University 16 R ‘leaps’ : uses C p

Data Mining Volinsky - Columbia University 17 Generalizing Linear Regression

Data Mining Volinsky - Columbia University 18 Complexity versus Goodness of Fit x y Training data

Data Mining Volinsky - Columbia University 19 Complexity versus Goodness of Fit x y x y Too simple? Training data

Data Mining Volinsky - Columbia University 20 Complexity versus Goodness of Fit x y x y x y Too simple? Too complex ? Training data

Data Mining Volinsky - Columbia University 21 Complexity versus Goodness of Fit x y x y x y x y Too simple? Too complex ?About right ? Training data

Data Mining Volinsky - Columbia University 22 Complexity and Generalization S train (  ) S test (  ) Complexity = degrees of freedom in the model (e.g., number of variables) Score Function e.g., squared error Optimal model complexity

Data Mining Volinsky - Columbia University 23 Non-linear models, linear in parameters We can add additional polynomial terms in our equations, non-linear functional form, but linear in the parameters (so still referred to as “linear regression”) –We can just treat the x i x j terms as additional fixed inputs –In fact we can add in any non-linear input functions!, e.g. Comments: -Number of parameters can explode => greater chance of overfitting –Adding complexity: must use penalties!

Data Mining Volinsky - Columbia University Non-linear (both model and parameters) We can generalize further to models that are nonlinear in all aspects where the g’s are non-linear functions (k of them) This is called a Neural Network (we’ll talk about it later) Closed form (analytical) solutions are rare. This is a a multivariate non-linear optimization problem (which may be quite difficult!) 24

Data Mining Volinsky - Columbia University 25 Generalizing Regression Generalized Linear Models (GLM) independent RV with distribution based on the error term linear combination of the predictors function which connects the two GLMs are defined by error structure (Gaussian, Poisson, Binomial) linear predictor (single variables, interactions, polynomials) link function (identity, log, reciprocal)

Data Mining Volinsky - Columbia University 26 Logistic Regression Logistic regression is the most common GLM. response in this case is binary (0,1). (Y follows a bernoulli or Binomial distribution) we model the probability of a 1 (p) occurring. for mathematical convenience, we model the odds: –p/(1-p) –log odds are even better - logit function –scales on the real line, rather than [0,1] Deviance: -2 x (difference in log-likelihood from saturated model)

Logistic Regression Interpretation of coefficients changes! Data Mining Volinsky - Columbia University 27

Data Mining Volinsky - Columbia University 28 Logistic example womensrole data (R handbook) –Survey in 1975: “Women should take care of running their homes and leave running the coutnry up to men” education sex agree disagree 1 0 Male Male Male Male Male Male Male Male Male Male Male …

Data Mining Volinsky - Columbia University 29 Womensrole Logistic fit

Data Mining Volinsky - Columbia University 30 Other GLMs Another useful GLM is for count data –model Y ~ Poisson(lambda) –link is log(Y) –Also called ‘log-linear’ models –Typically used for counts: People at a store Calls at a help center Spams in an hour

Data Mining Volinsky - Columbia University 31 Shrinkage Models: Ridge Regression Variable selection is a binary process –That makes it high variance: small changes can effect final model –Can we have a more continuous process, where each variable is ‘partly’ included? Ridge regression “shrinks” coefficients on by imposing a penalty for the model “size” Minimize the penalized sum of squares:  is a complexity parameter which controls the amount of shrinkage - the larger  is, the more the coefficients are shrunk towards 0.

Data Mining Volinsky - Columbia University 32 Ridge Regression Model is imposing a penalty on the coefficient size Since a’s depend on the units, care must be taken to standardize inputs. Also, you can show that the ridge estimates are a linear function of y: this adds a positive constant to the diagonal and allows inverision even if the matrix is not full rank –So, can be used in cases where p > n! In general: increasing bias, decreasing variance –Often decreases MSE

Data Mining Volinsky - Columbia University 33 Ridge coefficients df( ) is a one-to-one monotone function of  such that df( ) ranges from 0 to p. = 0; s=p : least squares solution; p degrees of freedom = inf; s=0; heaviest shrinkage; all parameter estimates = 0; zero degrees of freedom Look at plot as a function of degrees of freedom df( )

Data Mining Volinsky - Columbia University 34 Lasso Very similar to ridge with one important difference: L 2 penalty replaced by L 1 has an interesting effect on the profile plot: –if lambda is large then estimates go to zero –continuous variable selection –s=1 is least squares answer –s=0 all estimates are 0 –s=0.5 was the value chosen by cross validation

lasso coefficients Note how parameters shrink to zero! This is the appeal of lasso (in addition to good performance) Data Mining Volinsky - Columbia University 35 s = df( ) / p

Principal Components Regression Create PC from the original data vectors and use them in any of the above regression schemes Removes the ‘less important’ parts of the data space, while creating a reduced data set Since each PC is a linear combination of the original variables, we can express the solution in terms of the initial coefficients. Data Mining Volinsky - Columbia University 36

Comparison of results (prostate data) TermLSBest Subset RidgeLassoPCR Intercept Lcavol Lwight Age Lbph Svi Lcp Gleason Pgg Test Error Std Error Data Mining Volinsky - Columbia University 37 Cross validation allows all of these different methods to be comparable to each other

Nonparametric Modeling A nonparametric model does not assume any parameters to be estimated (thus the name nonparametric) –Its general form is Y = f(X) + ε –Typically, we only assume that f() is some smooth, continuous function –Also, we typically assume independent and identically distributed errors, ε~N(0,σ^2), but that’s not necessary. –1-D nonparametric regression = density estimation 38 Data Mining Volinsky - Columbia University

Advantages & Disadvantages Advantage –More flexibility leads to better data-fit, often also to better predictive capabilities –Smoothness can also lead to entirely new concepts, such as dynamics (via derivatives) and thus to flexible differential equation models, etc Disadvantage –Much more complexity, hard to explain 39 Data Mining Volinsky - Columbia University

Fitting Nonparametric models How do we estimate the function f()? –Restrictions on f: smoothness, continuity, existence of the first and second derivatives –options for estimating f include scatterplot smoothers, regression splines, smoothing splines, B-splines, thin- plate splines, wavelets, and many, many more… –one particularly popular option, the smoothing spline 40 Data Mining Volinsky - Columbia University

Splines Splines are piecewise polynomials smoothly connected together. The joining points of the polynomial pieces are called knots. Smoothing splines are splines that are penalized against too much local variability (and thus appear smoother) –Must be differentiable at the knots –linear spline: 0-times differentiable –cubic spline: twice differentiable 41 Data Mining Volinsky - Columbia University

Piecewise Polynomial cont. Piecewise constant and piecewise linear “Knots” 42 Data Mining Volinsky - Columbia University

Spline cont. (Linear Spline) 43 Data Mining Volinsky - Columbia University

Spline cont. (Cubic Spline) Cubic spline 44 Data Mining Volinsky - Columbia University

Definition of Smoothing Splines Smoothing Splines arise as the solution to the following simple regression problem –Find a piecewise polynomial f(x) with smooth breakpoints –f(x) minimizes the penalized sum-of-squares fitcurvature 45 Data Mining Volinsky - Columbia University

Example of Smoothing Splines Two Smoothing Splines fit to the Prestige Data –Little smoothing, λ small (red line) –Heavy smoothing, λ large (blue line) 46 Data Mining Volinsky - Columbia University

The smoothing parameter The magnitude of λ affects the quality of the smoother; many ad-hoc approaches to find a “good” smoothing parameter –Visual trial and error –Minimize mean-squared error of the fit –Cross-validation, optimization on hold-out sample, etc 47 Data Mining Volinsky - Columbia University

Prestige Data Revisited Education (X1) and Income (X2) influence the perceived Prestige (Y) of a profession Is there a linear relationship between the X’s and Y? If we’re not sure of the type of relationship between X and Y, nonparametric regression can be a very useful exploratory tool. 48 Data Mining Volinsky - Columbia University

Additive Model Estimates Parametric coefficients: Estimate std. err. t ratio Pr(>|t|) constant <2e-16 Approximate significance of smooth terms: edf chi.sq p-value s(income) e-10 s(education) <2e-16 R-sq.(adj) = Deviance explained = 84.7% GCV score = Intercept! Inference for Income and Education, similar to F-test Measures of model fit 49 Data Mining Volinsky - Columbia University

Compare to Classical Regression Parametric coefficients: Estimate std. err. t ratio Pr(>|t|) (Intercept) income e-08 education <2e-16 R-sq.(adj) = Deviance explained = 79.8% GCV score = Better model fit for the nonparametric model!! 50 Data Mining Volinsky - Columbia University

Function Estimates from Additive Regression Model What is the nature of the relationship of the individual predictor variables and prestige? 51 Data Mining Volinsky - Columbia University