Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Chapter 4 Multiple Regression.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Statistics 350 Lecture 10. Today Last Day: Start Chapter 3 Today: Section 3.8 Homework #3: Chapter 2 Problems (page 89-99): 13, 16,55, 56 Due: February.
Stat Today: Multiple comparisons, diagnostic checking, an example After these notes, we will have looked at (skip figures 1.2 and 1.3, last.
Copyright © Cengage Learning. All rights reserved. 13 Nonlinear and Multiple Regression.
Regression Diagnostics Checking Assumptions and Data.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Business Statistics - QBM117 Statistical inference for regression.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Correlation & Regression
Objectives of Multiple Regression
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Model Checking Using residuals to check the validity of the linear regression model assumptions.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Model Building III – Remedial Measures KNNL – Chapter 11.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
LOGO Chapter 4 Multiple Regression Analysis Devilia Sari - Natalia.
Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory.
Chapter 3: Diagnostics and Remedial Measures
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Regression Model Building LPGA Golf Performance
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
MARE 250 Dr. Jason Turner Multiple Regression. y Linear Regression y = b 0 + b 1 x y = dependent variable b 0 + b 1 = are constants b 0 = y intercept.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
BUILDING THE REGRESSION MODEL Data preparation Variable reduction Model Selection Model validation Procedures for variable reduction 1 Building the Regression.
Residual Analysis for ANOVA Models KNNL – Chapter 18.
1 MGT 511: Hypothesis Testing and Regression Lecture 8: Framework for Multiple Regression Analysis K. Sudhir Yale SOM-EMBA.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
ANOVA, Regression and Multiple Regression March
B AD 6243: Applied Univariate Statistics Multiple Regression Professor Laku Chidambaram Price College of Business University of Oklahoma.
KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.
Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
732G21/732G28/732A35 Lecture 3. Properties of the model errors ε 4. ε are assumed to be normally distributed
Simple Linear Regression and Correlation (Continue..,) Reference: Chapter 17 of Statistics for Management and Economics, 7 th Edition, Gerald Keller. 1.
Regression Analysis Part A Basic Linear Regression Analysis and Estimation of Parameters Read Chapters 3, 4 and 5 of Forecasting and Time Series, An Applied.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 12: Regression Diagnostics
Multivariate Analysis Lec 4
Diagnostics and Transformation for SLR
Residuals The residuals are estimate of the error
Lecture 12 Model Building
Chapter 13 Additional Topics in Regression Analysis
Diagnostics and Remedial Measures
Diagnostics and Transformation for SLR
Diagnostics and Remedial Measures
Presentation transcript:

Model Selection and Validation

Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory observational studies) 3. Model refinement and selection 4. Model validation

Data collection and preparation Controlled Experiments with Covariates: Statistical design of experiments uses supplemental information, such as characteristics of the experimental units, in designing the experiment so as to reduce the variance of the experimental error terms in the regression model. Sometimes, however, it is not possible to incorporate this supplemental information into the design of the experiment. Instead, it may be possible for the experimenter to incorporate this information into the regression model and thereby reduce the error variance by including uncontrolled variables or covariates in the model. Confirmatory Observational Studies: For these studies, data are collected for explanatory variables that previous studies have shown to affect the response variable, as well as for the new variable or variables involved in the hypothesis. In this context, the explanatory variable(s) involved in the hypothesis are sometimes called the primary variables, and the explanatory variables that are included to reflect existing knowledge are called the control variables (known risk factors in epidemiology). The control variables here are not controlled as in an experimental study, but they are used to account for known influences on the response variable.

Exploratory Observational Studies: After a lengthy list of potentially useful explanatory variables has 'been compiled, some of these variables can be quickly screened out. An explanatory variable (1) may not be fundamental to the problem, (2) may be subject to large measurement errors, and/or (3) may effectively duplicate another explanatory variable in the list. Explanatory variables that cannot be measured may either be deleted or replaced by proxy variables that are highly correlated with them.

Data Preparation Once the data have been collected, edit checks should be performed and plots prepared to identify gross data errors as well as extreme outliers. Preliminary Model Investigation: A variety of diagnostics should be employed to identify (1) the functional forms in which the explanatory variables should enter the regression model and (2) important interactions that should be included in the model Scatter plots and residual plots are useful for determining relationships and their strengths.

Reduction of Explanatory Variables Controlled Experiments: The reduction of explanatory variables in the model-building phase is usually not an important issue for controlled experiments. The experimenter has chosen the explanatory variables for investigation. Controlled Experiments with Covariates: In studies of controlled experiments with covariates, some reduction of the covariates may take place because investigators often cannot be sure in advance that the selected covariates will be helpful in reducing the error variance. Confirmatory Observational Studies: no reduction of explanatory variables should take place in confirmatory observational studies. The control variables were chosen on the basis of prior knowledge and should be retained for comparison with earlier studies even if some of the control variables tum out not to lead to any error variance reduction in the study at hand. Exploratory Observational Studies: In exploratory observational studies, the number of explanatory variables that remain after the initial screening typically is still large. Further, many of these variables frequently will be highly intercorrelated.

Model Refinement and Selection Checking tentative regression model, or the several "good" regression models in detail for curvature and interaction effects. Residual plots are helpful in deciding whether one model is to be preferred over another. identifying influential outlying observations, multicollinearity, etc.

Model Selection

Criteria for Model Selection

Automatic Search Procedures for Model Selection "Best" Subsets Algorithms: the best subsets according to a specified criterion are identified without requiring the fitting of all of the possible subset regression models.

Automatic Search Procedures for Model Selection Stepwise Regression Methods: Forward Stepwise Regression Backward Stepwise Regression

Model Refinement

F Test for Lack of Fit Assumptions The lack of fit test assumes that the observations Y for given X are (1) independent and (2) normally distributed, and that (3) the distributions of Y have the same variance Test statistic:

If the linear regression model is not appropriate for a data set: 1. Abandon regression model and develop and use a more appropriate model. 2. Employ some transformation on the data so that regression model is appropriate for the transformed data.

Diagnostics

Departures from Model to Be Studied by Residuals 1. The regression function is not linear. 2. The error terms do not have constant variance. 3. The error terms are not independent. 4. The model fits all but one or a few outlier observations. 5. The error terms are not normally distributed. 6. One or several important predictor variables have been omitted from the model.

Diagnostics for Residuals 1. Plot of residuals against predictor variable. 2. Plot of absolute or squared residuals against predictor variable. 3. Plot of residuals against fitted values. 4. Plot of residuals against time or other sequence. 5. Plots of residuals against omitted predictor variables. 6. Box plot of residuals. 7. Normal probability plot of residuals.

Nonlinearity of Regression Function residual plot against the predictor variable Residual plot against the fitted values scatter plot

Nonconstancy of Error Variance Residual plot against the predictor variable Residual plot against the fitted values Brown-Forsythe test and the Breusch-Pagan test

Presence of Outliers residual plots against X or Y box plots stem-and-leaf plots dot plots of the residuals. Plotting of semistudentized residuals

Nonindependence of Error Terms sequence plot of the residuals

Nonnormality of Error Terms box plot Histogram dot plot stem-and-leaf plot Normal Probability Plot of the residuals Goodness of fit tests: chi-square test or the Kolmogorov-Smirnov test and its modification, the Lilliefors test

Omission of Important Predictor Variables Residuals should be plotted against variables omitted from the model that might have important effects on the response

Model Adequacy for a Predictor Variable: Added-Variable Plots

Remedial measures

Nonlinearity of Regression Function transformation modify regression model by altering the nature of the regression function. For instance:

Nonconstancy of Error Variance ((Nonnormality of Error Terms

weighted least squares

Nonindependence of Error Terms When the error terms are correlated, a direct remedial measure is to work with a model that calls for correlated error terms

Multicollinearity Remedial Measures Ridge Regression: we prefer an estimator that has only a small bias and is substantially more precise than an unbiased estimator,

Remedial Measures for Influential Cases Robust Regression: IRLS Robust Regression LMS Regression

Model Validation

1. Collection of new data to check the model and its predictive ability. 2. Comparison of results with theoretical expectations, earlier empirical results, and simulation results. 3. Use of a holdout sample to check the model and its predictive ability.

Collection of New Data to Check Model Methods of Checking Validity: reestimate the model form chosen earlier using the new data calibrate the predictive capability of the selected regression model. mean squared prediction error:

Comparison with Theory, Empirical Evidence, or Simulation Results Comparisons of regression coefficients and predictions with theoretical expectations, previous empirical results, or simulation results

Data Splitting when the data set is large enough is to split the data into two sets. The first set, called the model- building set or the training sample, is used to develop the model. The second data set, called the validation or prediction set, is used to evaluate the reasonableness and predictive ability of the selected model. This validation procedure is often called cross-validation.