November 13, 2013 Collect model fits for 4 problems Return reports

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

Brief introduction on Logistic Regression
Week 12 November Four Mini-Lectures QMM 510 Fall 2014.
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Copyright © 2010 Pearson Education, Inc. Slide
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Ch11 Curve Fitting Dr. Deshi Ye
Model Assessment, Selection and Averaging
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Chapter 13 Multiple Regression
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
x – independent variable (input)
Copyright © 2010, 2007, 2004 Pearson Education, Inc. *Chapter 29 Multiple Regression.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 12 Multiple Regression
Chapter 4 Multiple Regression.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Lecture 6: Multiple Regression
Nonlinear Regression Probability and Statistics Boris Gervits.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Prediction and model selection
Chapter 11 Multiple Regression.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Classification and Prediction: Regression Analysis
Decision Tree Models in Data Mining
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Objectives of Multiple Regression
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Simple Linear Regression Models
CORRELATION & REGRESSION
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Model Building III – Remedial Measures KNNL – Chapter 11.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Chapter 3: Diagnostics and Remedial Measures
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ANOVA, Regression and Multiple Regression March
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Lecture 8: Ordinary Least Squares Estimation BUEC 333 Summer 2009 Simon Woodcock.
BPS - 5th Ed. Chapter 231 Inference for Regression.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Estimating standard error using bootstrap
Chapter 15 Multiple Regression Model Building
Chapter 14 Introduction to Multiple Regression
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Simple Linear Regression
Basic Practice of Statistics - 3rd Edition Inference for Regression
Regression Forecasting and Model Building
Presentation transcript:

November 13, 2013 Collect model fits for 4 problems Return reports VIFs Launch chapter 11

Grocery Data Assignment X3 (holiday) and X1 (cases shipped) do the job and X2 adds nothing; normality and variance constant okay; no need for quadratic terms or interactions; no need at all to square X3 (two level factor)

Some notes State conclusion (final model) up front Report model with fit of X1 and X3 rather than from a fit with X1, X2, X3 and just drop X2 Check assumptions for X1, X3 model not model with X2 as well Box-Cox suggests no transformation even though l = 2 is “best” Interaction bit

Notes on writing aspects Avoid imperative form of verbs (Fit the multivariate. Run the model. Be good.(You) Verb … Don’t use contractions. It’s bad form. They’re considered informal. Spell check does not catch wrong words (e.g., blow instead of below, not instead of note) Writing skills are important (benefits considerable)

Variance Inflation Factors VIFs (not BFFs) Variance Inflation Factors (1-R2)-1 where R2 is the R-squared from regressing Xi on the other Xi’s Available in JMP if you know where to look Body Fat examples

x3 appears to be both a “dud” for predicting y and not very collinear with either x1 or x2

In computing VIFs, need to regress x3 on x1 and x2, and compute 1/(1-RSquare) which here is about 100. What gives? I thought x3 was pretty much unrelated to x1 and x2?????

What we were “expecting”

Added variable plot to the rescue Note 1.6084963 matches earlier value (Suggest you run the other way as well.) Regress x3 and x2 on x1 and save residuals

X3 not so related to x1

X3 unrelated to x2

Show in 3-D to get the perspective

Interesting body fat example Looks like very little co-linearity However, massive multi-collinearity! This is why it can be challenging at times!!! Why we don’t throw extra dud variables into the model

Steps in the analysis Multivariate to get acquainted with data (Analyze distribution all variables) Looking for a decent model—parsimonious Linear, interactions, quadratic Stepwise if many variables PRESS vs. root mean square error Added variable plots according to taste Check assumptions (lots of plots) Check for outliers, influential observations (hats and Cook’s Di)

Chapter 11: Remedial Measures We’ll cover in some detail: 11.1 Weighted Least Squares 11.2 Ridge regression 11.4 Regression trees 11.5 Bootstrapping

11.1 Weighted Least Squares Suppose that the constant variance assumption does not hold. Each residual has a different variance but keep the zero covariances: Least squares is out—what should we do?

Use Maximum Likelihood for inspiration! Now define the i-th weight to be: Then the likelihood is:

Taking logarithms, we get: Log Likelihood is a constant plus: Criterion is same as least squares, except each squared residual is weighted by wi -- hence the weighted least squares criterion. The coefficient vector bw that minimizes Qw is the vector of weighted least squares estimates

Matrix Approach to WLS Let: Then:

By popular demand….lots of stuff cancels.

Usual Case: Variances are Unknown Need to estimate each variance! Recall: Give a statistic that can estimate si2: Give a statistic that can estimate si: MSE……root MSE….. Press?????

Estimating a Standard Deviation Function Step 1: Do ordinary least squares; obtain residuals Step 2: Regress the absolute values of the residuals against Y or whatever predictor(s) seem to be associated with changes in the variances of the residuals. Step 3: Use the predicted absolute residual for case i, |ei| as the estimated variance of ei, call it si Step 4: Then wi = (1/si)2 ^ ^ ^ ^

Subset x and y for table 11.1 Fit y on x and save residuals, compute absolute value of residuals Regress these residuals on x The predicted values are estimated stan. Dev.’s Weights are reciprocal of stan. Dev. Squared Use these weights with WLS on original y and x variables to get y-hat = 55.566 + .5963 x

Pictures

Example

Notes on WLS Estimates WLS estimates are minimum variance, unbiased. If you use Ordinary Least Squares (OLS) when variance is not constant, estimates are still unbiased, just not minimum variance. If you have replicates at each unique X category, you can just use the sample standard deviation of the responses at each category to determine the weight for any response in the category. R2 has no clear cut meaning here. Must use the standard deviation function value (instead of s) for confidence intervals for prediction

11.2 Ridge Regression Biased regression to reduce the effect of multicollinearity. Shrinkage estimation: Reduce the variance of the parameters by shrinking them (a bit) in absolute magnitude. This will introduce some bias, but may reduce the MSE overall. Recall: MSE = bias squared plus variance: Not in original pdf of slides

Not in orig. pdf of slides

How to Shrink? Penalized least squares! Start with standardized regression model: Add a “penalty” proportional to the total size of the parameters (proportionality or biasing constant is c): Not in pdf of slides

Matrix Ridge Solution Start with small c and increase (iteratively) until the coefficients stabilize. Plot is called “ridge trace” Not in pdf of slides Here, use c about equal to .02

Example Not in pdf of slides

Decision trees… section 11.4 Kddnuggets.com suggests… A bit of history Breiman et al. Respectability Oldie but goodie slides SA Titanic data Some bootstrapping stuff (probably not tonight)

Taxonomy of Methods

Data mining and Predictive Modeling Predictive modeling mainly involves the application of: Regression Logistic regression Regression trees Classification trees Neural networks to very large data sets. The difference, technically, is because we have so much data, we can rely on the use of validation techniques---the use of training, validation and test sets to assess our models. There is much less concern about: Statistical significance (everything is significant!) Outliers/influence (a few outliers have no effect) Meaning of coefficients (models may have thousands of predictors) Distributional assumptions, independence, etc. Recall the quiz problem with 387K observations…even dopey ZIP Code and AgeBuilt were significant

Data mining and Predictive Modeling We will talk about some of the statistical techniques used in predictive modeling, once the data have been gathered, cleaned, organized. But data gathering usually involves merging disparate data from different sources, data warehouses, etc., and usually represents at least 80% of the work. General Rule: Your organization’s data warehouse will not have the information you need to build your predictive model. (Paraphrased, Usama Fayyad, VP data, Yahoo)

Regression Trees Idea: Can we cut up the predictor space into rectangles such that the response is roughly constant in each rectangle, but the mean changes from rectangle to rectangle? We’ll just use the sample average (Y) in each rectangle as our predictor! Simple, easy-to-calculate, assumption-free, nonparametric regression method. Note there is no “equation.” The predictive model takes the form of a decision tree. _

See file ch11ta08steroidSplitTreeCalc.jmp Overall Average of y is Steroid Data See file ch11ta08steroidSplitTreeCalc.jmp Overall Average of y is 17.64; SSE is 1284.8 distinct ages 8 9 10 11 12 13 14 16 17 18 19 21 23 24 num in group 2 3 1 1 2 2 2 1 3 2 1 2 2 1

Example: Steroid Data Predictive Model _ Fit ^ 13.675 16.95 22.2 3.55 8.133 ^ Example: What is Y at Age = 9.5?

How do we find the regions (i.e., grow the tree)? For one predictor X, it’s easy. Step 1: To find the first split point Xs, make a grid of possible split points along the X axis. Each possible split point divides the X axis into two regions R21 and R22. Now compute SSE for the two-region regression tree: Do this for every grid point X. The point that leads to the minimum SSE is the split point. Steps 2: If you now have r regions, determine the best split point for each of the r regions as you did in step 1; choose the one that leads to the lowest SSE for the r + 1 regions. Steps 3: Repeat Step 2 until SSE levels off (more later on stopping)

Illustrate first split with Steroid Data See file ch11ta08steroidSplitTreeCalc.jmp Overall Average of y is 17.64; SSE is 1284.8 distinct ages 8 9 10 11 12 13 14 16 17 18 19 21 23 24 num in group 2 3 1 1 2 2 2 1 3 2 1 2 2 1

In the JMP file, aforementioned… Point out calculations needed to determine optimal first split Easy but a bit tedious Binary vs. multiple splits Run it in JMP, be sure to set min # in splits Fit conventional model as well… Don’t forget the titanic data…

Growing the Steroid Level Tree Split 1 Split 2 Split 3 Split 4

When do we stop growing? If you let the growth process go on forever, you’ll eventually have n regions each with just one observation. The mean of each region is the value of the observation, and R2 = 100%. (You fitted n means (parameters) and so you have n – n = 0 degrees of freedom for error). Where to stop?? We do this by data-splitting and cross-validation. After each split, use your model (tree) to predict each observation in a hold-out sample and compute MSPR or R2 (holdout) . As we saw with OLS regression, MSPR will start to increase (R2 for holdout will decrease) when we overfit. We can rely on this because we have very large sample sizes. Comment assumes no replicates

What about multiple predictors? For two or more predictors, no problem. For each region, we have to determine the best predictor to split on AND the best split point for that predictor. So if we have p – 1 predictors, and at stage r we have r regions, there are r(p – 1) possible split points. Example: Three splits for two predictors

GPA Data Results (text)

Using JMP for Regression Trees Analyze >> Modeling >> Partition Exclude at least 1/3 for validation sample using: Rows >> Row Selection >> Select Randomly; then Rows >> Exclude JMP will automatically give the predicted R2 value (1 – SSE/SSTO for the validation set) You need to manually call for a split (doesn’t fit the tree automatically)

Split button Note: R2 for hold-out sample As you grow the tree this value will peak and begin to decline! Clicking the red triangle gives options: select “split history” to see a plot of predicted R2 vs. number of splits

Classification Trees Regression tree equivalent of logistic regression Response is binary 0-1; average response in each region is now p, not Y For each possible split point, instead of SSE, we compute the G2 statistic for the resulting 2 by r contingency table. Split goes to the smallest value. (Can also use the negative of the log(p-value), where the p-value is adjusted in a Bonferroni-like manner. This is called the “Logworth” statistic. Again, you want a small value.)

Understanding ROC and Lift Charts Assessing ability to classify a case (predict) correctly in logistic regression, classification trees, or neural networks (with binary responses) as a function of the cutoff value chosen. ROC Curve: Plot true positive rate [P(Y = 1|Y=1)] vs false positive rate [(P(Y = 1|Y=0)]. Example 1: Classify top 40% (of predicted probabilities) as 1; bottom 60% as 0. Same as cutoff = .45, here. Pred prob: .49 .48 .47 .46 .43 .41 .38 .36 .32 .29 Data: 1 1 1 0 0 1 1 0 1 0 Classification: 1 1 1 1 0 0 0 0 0 0 Top 40%--------- Bottom 60%---------------- ^ ^

Calculating sensitivity (true pos) and 1-specificity (false pos) True positive rate: P(Yhat = 1|Y=1) = 3/6 = .5 (Y axis value) False positive rate: P(Yhat = 1|Y=0) = 1/4 = .25 (X axis value)

Top 40%--------- Bottom 60%---------------- Example 2: Classify top 40% (of predicted probabilities) as 1; bottom 60% as 0. Same as cutoff = .45, here. Pred prob: .49 .48 .47 .46 .43 .41 .38 .36 .32 .29 Data: 1 1 1 0 0 0 0 0 0 0 Classification: 1 1 1 1 0 0 0 0 0 0 Top 40%--------- Bottom 60%---------------- True positive rate: P(Yhat = 1|Y=1) = 3/3 = 1.0 (Y axis value) False positive rate: P(Yhat = 1|Y=0) = 1/7 = .1428 (X axis value)

11.5 Bootstrapping in Regression A method that uses computer simulation, rather than theory and analytical results, to obtain sampling distributions of statistics. From these we can estimate the precision of an estimator.

Background: Simulated Intervals Suppose: Our objective is to get a confidence interval for the slope in a simple linear regression setting. We know the distribution of Y at each X value. How can we use computer simulation (Minitab) to get a confidence interval for the slope?

Background: Simulated Intervals Easy: Obtain a random Y value for each of the n X points Compute the regression. Store b1 Do the above, say 1,000,000 times. Do a histogram of the b1 values, use the .025 and .975 percentiles!

Simulated Intervals: Example Toluca company data: Assume E(Y) = 62 + 3.6 X and s = 50 That is: e ~ N(62 + 3.6 X , 50). Exec: let k1 = k1 + 1 random 25 c3; normal 0 50. let c4 = 62 + 3.6*X + c3 Regress c4 1 'X'; Coefficients c5. let c6(k1) = c5(2)

What if we don’t know the distribution of errors? Answer: Use the empirical distribution: Fit the model (assume its true) Then in the simulation, for each run, obtain a random sample n residuals (with replacement) from the n observed residuals. Compute the new Y values, run the regression, and store the bootstrap slope value, b1* This is the basic approach to the fixed-X sampling bootstrap

To obtain confidence interval: Could use percentiles as previous. Better approach is the reflection method: d1 = b1 – b1*(a/2) d2 = b1*(1 - a/2) – b1 b1 – d2 < b1 < b1 + d1

Random-X Sampling Version When error variances are not constant or predictor variables cannot be regarded as fixed constants, random X sampling is used: For each bootstrap sample, we sample a (Y, X) pair with replacement from the data set. In effect we sample rows of the data set with replacement.

Fixed X Example—Toluca Data Assume that the base regression has been run. We have stored the residuals in column c3, predicted values in c4. let k1 = k1 + 1 sample 25 c3 c5; replace. let c6 = c4 + c5 Regress c6 1 'X'; Coefficients c7. let c8(k1) = c7(2)

Neural Networks The i-th observation is modeled as a nonlinear function of m derived predictors, H0, …, Hm-1.

Neural Networks OK, so what is gY and how are the predictors derived? gY is usually a logistic function and the Hj are a nonlinear function of a linear combination of the predictors X Here Xi is the i-th row of the X matrix

Neural Networks Put these together and you get the neural network model: A common choice for all of the nonlinear function is again the logistic:

Neural Networks The gj functions are sometimes called the “activation” functions: The original idea was that when a linear combination of the predictors got large enough, a brain synapse would “fire” or “activate.” So this was an attempt to model a “step” input function.

The gj functions are sometimes called the “activation” functions: Neural Networks The gj functions are sometimes called the “activation” functions: The original idea was that when a linear combination of the predictors got large enough, a brain synapse would “fire” or “activate.” So this was an attempt to model a “step” input function. Same as 461??

Neural Networks Using the logistic for the gY and gj functions leads to the single-hidden-layer, feedforward neural network. Sometimes called the single layer perceptron.

Network Representation Useful to view as network and compare to multiple regression:

Parameter Estimation: Penalized Least Squares Recall that we found if too many parameters are fit in OLS, our ability to predict hold-out data can deteriorate. So we looked at adjusted R2, AIC, BIC, Mallows Cp, which all have built-in penalties for having too many parameters. Dropping some predictors is like setting the corresponding parameter estimate to zero, which “shrinks” the size of the regression coefficient vector: Another way to do this would be to leave all of the predictors in, but require that there be penalty on the estimation method for Size(b).

Parameter Estimation: Penalized Least Squares This leads to the “penalized least squares” method. Choose the parameter estimates to minimize: Where the overfit penalty: is the sum of squares of the estimates.

Example Using JMP (SAS) Software We’ll consider the Ischemic Heart Disease data set in Appendix C.9. Response is log(total cost subscriber claims), and the predictors considered are: Note: X1 is variable 5, X2 is variable 6, X3 is variable 9, and X4 is variable 8. The first 400 observations are used to fit (train) the model, and the last 388 are held out for validation

Example Using JMP (SAS) Software

Example Using JMP (SAS) Software

Example Using JMP (SAS) Software

Comparison with Linear and Quadratic OLS Fits

Comparison of Statistical and NN Terms

A Sampling Application Frequently, have an idea about the variability in y off of an x-variable Forestry application: What is average age of trees in a stand Diameter of tree is “easy” Age of tree via ????? Diameter (first year grad students); age of trees in sample (second year grad students); estimation of average age in forest (3rd year); Summary report and invoice for the Professor/supervisor

Sample tended to have “smaller trees in it. Fit corrects for this.) 1132 in the forest Average diameter is 10.3 118.3 off of fit (raw average age is 107.4 ; Average diameter in Sample is 9.44…i.e., Sample tended to have “smaller trees in it. Fit corrects for this.) Note, got 118.46 using estimated standard deviations Run the fit on the lohr data