Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor.

Slides:



Advertisements
Similar presentations
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Advertisements

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Multiple Regression II 4/11/12 Categorical explanatory variables Adjusted R 2 Not in book Professor Kari Lock Morgan Duke University.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6, 9.1 Least squares line Interpreting.
Chapter 4 The Relation between Two Variables
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Categorical variables Variable selection.
Regression With Categorical Variables. Overview Regression with Categorical Predictors Logistic Regression.
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 25, Slide 1 Chapter 25 Comparing Counts.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. *Chapter 29 Multiple Regression.
Do not think the point of your paper is to interpret in detail every single regression statistic. Don’t give each of the independent variables equal emphasis.
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Stat 217 – Day 27 Chi-square tests (Topic 25). The Plan Exam 2 returned at end of class today  Mean.80 (36/45)  Solutions with commentary online  Discuss.
Lecture 24: Thurs., April 8th
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Multiple Regression 2 Sociology 5811 Lecture 23 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
10. Introduction to Multivariate Relationships Bivariate analyses are informative, but we usually need to take into account many variables. Many explanatory.
Simple Linear Regression Least squares line Interpreting coefficients Prediction Cautions The formal model Section 2.6, 9.1, 9.2 Professor Kari Lock Morgan.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Probability SECTION 11.1 Events and, or, if Disjoint Independent Law of total.
Correlation & Regression
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/27/12 Multiple Regression SECTION 10.3 Categorical variables Variable.
Synthesis and Review 3/26/12 Multiple Comparisons Review of Concepts Review of Methods - Prezi Essential Synthesis 3 Professor Kari Lock Morgan Duke University.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Simple Linear Regression
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 (?) Multiple explanatory variables.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Least-Squares Regression Section 3.3. Why Create a Model? There are two reasons to create a mathematical model for a set of bivariate data. To predict.
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables.
Multiple Regression I 4/9/12 Transformations The model Individual coefficients R 2 ANOVA for regression Residual standard error Section 9.4, 9.5 Professor.
Experiments and Causal Inference ● We had brief discussion of the role of randomized experiments in estimating causal effects earlier on. Today we take.
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
UTOPPS—Fall 2004 Teaching Statistics in Psychology.
Regression Analysis A statistical procedure used to find relations among a set of variables.
Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.
Multiple Regression BPS chapter 28 © 2006 W.H. Freeman and Company.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTIONS 9.1, 9.3 Inference for slope (9.1)
Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.
WARM-UP Do the work on the slip of paper (handout)
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Variable selection Confounding variables.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 12/6/12 Synthesis Big Picture Essential Synthesis Bayesian Inference (continued)
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 Multiple explanatory variables (10.1,
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTION 2.6 Interpreting coefficients Prediction.
We will use the 2012 AP Grade Conversion Chart for Saturday’s Mock Exam.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Statistics 200 Lecture #6 Thursday, September 8, 2016
QM222 Class 13 Section D1 Omitted variable bias (Chapter 13.)
Regression Analysis.
STAT 250 Dr. Kari Lock Morgan
Lecture 12 Model Building
Regression Analysis.
Presentation transcript:

Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor Kari Lock Morgan Duke University

Project 2 Presentation (Thursday, 4/19) Project 2 Presentation Project 2 Paper (Wednesday, 4/25) Project 2 Paper To Do

Categorical Variables Models are estimated better with fewer coefficients to estimate A categorical variables requires estimating a coefficient for each category (minus 1) Unless you have a very large sample size, refrain from including categorical variables with lots of categories (or else group the categories)

Categorical Variables Sometimes, categorical variables are coded with numbers If this is the case, R will interpret it as a quantitative variable, not a categorical variable To make sure it is treated like a categorical variable, BEFORE attaching the data use dataname$variablename = as.factor(dataname$variablename)

Missing Data If a cases has missing data for any of the explanatory variables, it will be left out of the regression If there is lots of missing data, your sample size could be greatly reduced Consider leaving out variables with lots of missing data, especially if your sample size is small to begin with

Missing Data If the proportion of cases with missing data is small, then you do not have much to worry about If there are lots of missing values, you could get a completely wrong answer by just leaving those cases out Simply ignoring missing data can be very dangerous, but dealing with it appropriately requires another course in statistics In life after STAT 101, if you want to analyze a dataset with lots of missing data, consult with a statistician

Smokers

If smoking was banned in a state, the percentage of smokers would most likely decrease. In that case, the percentage voting Democratic would… (a) increase (b) decrease (c) impossible to tell

Causation A significant explanatory variable in a regression model indicates association, but not necessarily causation CAUSALITY CAN ONLY BE INFERRED FROM A RANDOMIZED EXPERIMENT!!!!

Causation

Variable Selection The p-value for an explanatory variable can be taken as a rough measure for how helpful that explanatory variable is to the model Insignificant variables may be pruned from the model You can also look at relationships between explanatory variables; if two are strongly associated, perhaps both are not necessary

Variable Selection (Some) ways of deciding whether a variable should be included in the model or not: 1.Does it improve adjusted R 2 ? 2.Does it have a low p-value? 3.Is it associated with the response by itself? 4.Is it strongly associated with another explanatory variables? (If yes, then including both may be redundant) 5.Does common sense say it should contribute to the model? What would you eliminate from the model? (handout)

Stepwise Regression We could go through and think hard about which variables to include, or we could automate the process Stepwise regression drops insignificant variables one by one This is particularly useful if you have many potential explanatory variables

Full Model

Pruned Model 1

Pruned Model 2

Pruned Model 3

Pruned Model 4

Pruned Model 5

FINAL STEPWISE MODEL

Project 2 For project 2, I recommend doing variable selection both ways: 1)manually choose which variables to include based on p-values, pairwise relationships, and common sense 2)use stepwise regression and then choose between these two models

Electricity and Life Expectancy Cases: countries of the world Response variable: life expectancy Explanatory variable: electricity use (kWh per capita) Is a country’s electricity use helpful in predicting life expectancy?

Electricity and Life Expectancy

Outlier: Iceland

Electricity and Life Expectancy

Is this a good model for predicting life expectancy based on electricity use? (a) Yes (b) No

Electricity and Life Expectancy Is a country’s electricity use helpful in predicting life expectancy? (a) Yes (b) No

Electricity and Life Expectancy If we increased electricity use in a country, would life expectancy increase? (a) Yes (b) No (c) Impossible to tell

Confounding Variables Wealth is an obvious confounding variable that could explain the relationship between electricity use and life expectancy Multiple regression is a powerful tool that allows us to account for confounding variables We can see whether an explanatory variable is still significant, even after including potential confounding variables in the model

Electricity and Life Expectancy Is a country’s electricity use helpful in predicting life expectancy, even after including GDP in the model? (a) Yes(b) No Once GDP is accounted for, electricity use is no longer a significant predictor of life expectancy.

Which is the “best” model? (a) (b) (c)

Cases: countries of the world Response variable: life expectancy Explanatory variable: number of mobile cellular subscriptions per 100 people Is a country’s cell phone subscription rate helpful in predicting life expectancy? Cell Phones and Life Expectancy

Is this a good model for predicting life expectancy based on cell phone subscriptions? (a) Yes (b) No Cell Phones and Life Expectancy

Is a country’s number of cell phone subscriptions per capita helpful in predicting life expectancy? (a) Yes (b) No Cell Phones and Life Expectancy

If we gave everyone in a country a cell phone and a cell phone subscription, would life expectancy in that country increase? (a) Yes (b) No (c) Impossible to tell Cell Phones and Life Expectancy

Is a country’s cell phone subscription rate helpful in predicting life expectancy, even after including GDP in the model? (a) Yes(b) No Cell Phones and Life Expectancy

This says that wealth alone can not explain the association between cell phone subscriptions and life expectancy This suggests that either cell phones actually do something to increase life expectancy (causal) OR there is another confounding variable besides wealth of the country

Confounding Variables Multiple regression is one potential way to account for confounding variables This is most commonly used in practice across a wide variety of fields, but is quite sensitive to the conditions for the linear model (particularly linearity) You can only “rule out” confounding variables that you have data on, so it is still very hard to make true causal conclusions without a randomized experiment