4.3 Diagnostic Checks VO 107.425 - Verallgemeinerte lineare Regressionsmodelle.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Qualitative predictor variables
12-1 Multiple Linear Regression Models Introduction Many applications of regression analysis involve situations in which there are more than.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Prediction, Goodness-of-Fit, and Modeling Issues ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Stat 112: Lecture 15 Notes Finish Chapter 6: –Review on Checking Assumptions (Section ) –Outliers and Influential Points (Section 6.7) Homework.
Lecture 25 Regression diagnostics for the multiple linear regression model Dealing with influential observations for multiple linear regression Interaction.
Class 6: Tuesday, Sep. 28 Section 2.4. Checking the assumptions of the simple linear regression model: –Residual plots –Normal quantile plots Outliers.
Linear statistical models 2008 Model diagnostics  Residual analysis  Outliers  Dependence  Heteroscedasticity  Violations of distributional assumptions.
Lecture 24 Multiple Regression (Sections )
Lecture 24: Thurs., April 8th
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #20.
Regression Diagnostics Checking Assumptions and Data.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Business Statistics - QBM117 Statistical inference for regression.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Model Checking in the Proportional Hazard model
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Relationships Among Variables
T-tests and ANOVA Statistical analysis of group differences.
Correlation & Regression
Copyright © 2011 Pearson Education, Inc. Multiple Regression Chapter 23.
Regression and Correlation Methods Judy Zhong Ph.D.
Correlation and Regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
AP Statistics Section 15 A. The Regression Model When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative.
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot.
Analysis of Residuals Data = Fit + Residual. Residual means left over Vertical distance of Y i from the regression hyper-plane An error of “prediction”
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12: Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
Chapter 14 Inference for Regression © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
Prediction, Goodness-of-Fit, and Modeling Issues Prepared by Vera Tabakova, East Carolina University.
1 1 Slide Simple Linear Regression Estimation and Residuals Chapter 14 BA 303 – Spring 2011.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
Simple linear regression Tron Anders Moger
Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)
Outliers and influential data points. No outliers?
Applied Quantitative Analysis and Practices LECTURE#31 By Dr. Osman Sadiq Paracha.
Discrepancy between Data and Fit. Introduction What is Deviance? Deviance for Binary Responses and Proportions Deviance as measure of the goodness of.
Logistic Regression Analysis Gerrit Rooks
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
Slide 1 Regression Assumptions and Diagnostic Statistics The purpose of this document is to demonstrate the impact of violations of regression assumptions.
1 Reg12W G Multiple Regression Week 12 (Wednesday) Review of Regression Diagnostics Influence statistics Multicollinearity Examples.
Individual observations need to be checked to see if they are: –outliers; or –influential observations Outliers are defined as observations that differ.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression Analysis.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Regression Diagnostics
Regression Analysis.
Chapter 12: Regression Diagnostics
Diagnostics and Transformation for SLR
Stats Club Marnie Brennan
Residuals The residuals are estimate of the error
Three Measures of Influence
Linear Regression and Correlation
Essentials of Statistics for Business and Economics (8e)
Diagnostics and Transformation for SLR
Presentation transcript:

4.3 Diagnostic Checks VO Verallgemeinerte lineare Regressionsmodelle

Diagnostic Checks  Goodness-of-fit tests provide only global measures of the fit of a model.  Regression diagnostics aims at identifying reasons for a bad fit.  Diagnostic measures should in particular identify observations  that are not well explained by the model.  that are influential for some aspect of it.

4.3.1 Residuals Residuals measure the agreement between single observations and their fitted values and help to identify poorly fitting observations that may have a strong impact on the overall fit of the model. For scaled binomial data the Pearson residual has the form: with as the estimated standard deviation as the probability for the fitted model as the number of observations

4.3.1 Residuals For small n i the distribution of r p (y i, π i ) is rather skewed, an effect that is ameliorated by using the transformation to Anscombe residuals: where Ansombe residuals consider an approximation to by the use of the delta method, which yields The Pearson residuals cannot be expected to have unit variance because the variance of the residual has not been taken into account. The standardization just uses the estimated standard deviation of

4.3.1 Residuals

residuals against ordered fitted values  if one suspects that particular values should be transformed before being included in the linear predictor residuals against corresponding quantiles of a normal distribution  Compares the standardized residuals to the order of an N(0,1)-sample. If the model is correct and residuals can be expected to be approximately normally distributed, the plot should show approximately a straight line as long as outliers are absent.

Example 4.3: Unemployment In a study on the duration of unemployment with sample size n = 982 we distinguish between short term unemployment ( ≤ 6 months) and long-term unemployment (> 6 months). It is shown that that for older unemployed persons the, the fitted values tend to be larger than the observed.

Example 4.4: Food-Stamp Data The food-stamp data from Künsch et al. (1989) consists of n = 150 persons, 24 of whom participated in the federal food-stamp program. The response indicates participation. The predictor variables represent the binary variables tenancy (TEN) supplemental income (SUP) log-transformation of monthly income log(monthly income + 1) (LMI)

Hat Matrix and Influential Observations

4.3.3 Case Deletion

Example 4.5: Unemployment Cook‘s distances for unemployment data show that observations 33, 38, 44, which correspond to ages 48, 53, 59, are influential. All three observations are rather far from the fit.

Example 4.6: Exposure to Dust (Non-Smokers) Observed covariates:  mean dust concentration at working place in mg/m³ (dust)  duration of exposure in years (years)  smoking (1: yes; 0: no) Binary response:  Bronchitis (1: present; 0: not present) Sample number:  n = 1246

Table 4.4 shows the estimated coefficient for the main effects model. Table 4.5 shows the fit without the observation (15.04, 27)  It can be seen that the coefficient for the concentration of dust has distinctly changed. Example 4.6: Exposure to Dust (Non-Smokers)

As seen in Figure 4.7, large values of Cook’s distance are the following observations with their respective values: 730 (1.63, 8); 1175 (8, 32); 1210 (8, 13)(dust, years) All three observations correspond to persons with bronchitis. They are NOT extreme in the range of years, which is the influential variable. The variable dust shows no significant effect, and therefore it is only a consequence that the Cool’s distance is small. Example 4.6: Exposure to Dust (Non-Smokers)

In this example the exposure data, including non-smokers, is used. The full dataset  concentration of dust  years of exposure  smoking are significantly influential!  one observation positioned very extreme in the observation space! Example 4.7: Exposure to Dust

By excluding the extreme value  coefficient estimates for the variables years and smoking are similar to the estimates for the full data set.  coefficients for the variable concentration of dust differ by about 8%. Since observation 1246 is far away from the data and the mean exposure is a variable not easy to measure  should be considered as an outlier and omitted.

Thank you for your attention!