Presentation is loading. Please wait.

Presentation is loading. Please wait.

Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.

Similar presentations


Presentation on theme: "Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory."— Presentation transcript:

1 Model Selection and Validation

2 Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory observational studies) 3. Model refinement and selection 4. Model validation

3 Data collection and preparation Controlled Experiments with Covariates: Statistical design of experiments uses supplemental information, such as characteristics of the experimental units, in designing the experiment so as to reduce the variance of the experimental error terms in the regression model. Sometimes, however, it is not possible to incorporate this supplemental information into the design of the experiment. Instead, it may be possible for the experimenter to incorporate this information into the regression model and thereby reduce the error variance by including uncontrolled variables or covariates in the model. Confirmatory Observational Studies: For these studies, data are collected for explanatory variables that previous studies have shown to affect the response variable, as well as for the new variable or variables involved in the hypothesis. In this context, the explanatory variable(s) involved in the hypothesis are sometimes called the primary variables, and the explanatory variables that are included to reflect existing knowledge are called the control variables (known risk factors in epidemiology). The control variables here are not controlled as in an experimental study, but they are used to account for known influences on the response variable.

4 Exploratory Observational Studies: After a lengthy list of potentially useful explanatory variables has 'been compiled, some of these variables can be quickly screened out. An explanatory variable (1) may not be fundamental to the problem, (2) may be subject to large measurement errors, and/or (3) may effectively duplicate another explanatory variable in the list. Explanatory variables that cannot be measured may either be deleted or replaced by proxy variables that are highly correlated with them.

5 Data Preparation Once the data have been collected, edit checks should be performed and plots prepared to identify gross data errors as well as extreme outliers. Preliminary Model Investigation: A variety of diagnostics should be employed to identify (1) the functional forms in which the explanatory variables should enter the regression model and (2) important interactions that should be included in the model Scatter plots and residual plots are useful for determining relationships and their strengths.

6 Reduction of Explanatory Variables Controlled Experiments: The reduction of explanatory variables in the model-building phase is usually not an important issue for controlled experiments. The experimenter has chosen the explanatory variables for investigation. Controlled Experiments with Covariates: In studies of controlled experiments with covariates, some reduction of the covariates may take place because investigators often cannot be sure in advance that the selected covariates will be helpful in reducing the error variance. Confirmatory Observational Studies: no reduction of explanatory variables should take place in confirmatory observational studies. The control variables were chosen on the basis of prior knowledge and should be retained for comparison with earlier studies even if some of the control variables tum out not to lead to any error variance reduction in the study at hand. Exploratory Observational Studies: In exploratory observational studies, the number of explanatory variables that remain after the initial screening typically is still large. Further, many of these variables frequently will be highly intercorrelated.

7 Model Refinement and Selection Checking tentative regression model, or the several "good" regression models in detail for curvature and interaction effects. Residual plots are helpful in deciding whether one model is to be preferred over another. identifying influential outlying observations, multicollinearity, etc.

8 Model Selection

9 Criteria for Model Selection

10 Automatic Search Procedures for Model Selection "Best" Subsets Algorithms: the best subsets according to a specified criterion are identified without requiring the fitting of all of the possible subset regression models.

11 Automatic Search Procedures for Model Selection Stepwise Regression Methods: Forward Stepwise Regression Backward Stepwise Regression

12 Model Refinement

13 F Test for Lack of Fit Assumptions The lack of fit test assumes that the observations Y for given X are (1) independent and (2) normally distributed, and that (3) the distributions of Y have the same variance Test statistic:

14 If the linear regression model is not appropriate for a data set: 1. Abandon regression model and develop and use a more appropriate model. 2. Employ some transformation on the data so that regression model is appropriate for the transformed data.

15 Diagnostics

16 Departures from Model to Be Studied by Residuals 1. The regression function is not linear. 2. The error terms do not have constant variance. 3. The error terms are not independent. 4. The model fits all but one or a few outlier observations. 5. The error terms are not normally distributed. 6. One or several important predictor variables have been omitted from the model.

17 Diagnostics for Residuals 1. Plot of residuals against predictor variable. 2. Plot of absolute or squared residuals against predictor variable. 3. Plot of residuals against fitted values. 4. Plot of residuals against time or other sequence. 5. Plots of residuals against omitted predictor variables. 6. Box plot of residuals. 7. Normal probability plot of residuals.

18 Nonlinearity of Regression Function residual plot against the predictor variable Residual plot against the fitted values scatter plot

19 Nonconstancy of Error Variance Residual plot against the predictor variable Residual plot against the fitted values Brown-Forsythe test and the Breusch-Pagan test

20 Presence of Outliers residual plots against X or Y box plots stem-and-leaf plots dot plots of the residuals. Plotting of semistudentized residuals

21 Nonindependence of Error Terms sequence plot of the residuals

22 Nonnormality of Error Terms box plot Histogram dot plot stem-and-leaf plot Normal Probability Plot of the residuals Goodness of fit tests: chi-square test or the Kolmogorov-Smirnov test and its modification, the Lilliefors test

23 Omission of Important Predictor Variables Residuals should be plotted against variables omitted from the model that might have important effects on the response

24 Model Adequacy for a Predictor Variable: Added-Variable Plots

25 Remedial measures

26 Nonlinearity of Regression Function transformation modify regression model by altering the nature of the regression function. For instance:

27 Nonconstancy of Error Variance ((Nonnormality of Error Terms

28 weighted least squares

29 Nonindependence of Error Terms When the error terms are correlated, a direct remedial measure is to work with a model that calls for correlated error terms

30 Multicollinearity Remedial Measures Ridge Regression: we prefer an estimator that has only a small bias and is substantially more precise than an unbiased estimator,

31 Remedial Measures for Influential Cases Robust Regression: IRLS Robust Regression LMS Regression

32 Model Validation

33 1. Collection of new data to check the model and its predictive ability. 2. Comparison of results with theoretical expectations, earlier empirical results, and simulation results. 3. Use of a holdout sample to check the model and its predictive ability.

34 Collection of New Data to Check Model Methods of Checking Validity: reestimate the model form chosen earlier using the new data calibrate the predictive capability of the selected regression model. mean squared prediction error:

35 Comparison with Theory, Empirical Evidence, or Simulation Results Comparisons of regression coefficients and predictions with theoretical expectations, previous empirical results, or simulation results

36 Data Splitting when the data set is large enough is to split the data into two sets. The first set, called the model- building set or the training sample, is used to develop the model. The second data set, called the validation or prediction set, is used to evaluate the reasonableness and predictive ability of the selected model. This validation procedure is often called cross-validation.


Download ppt "Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory."

Similar presentations


Ads by Google