Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Categorical variables Variable selection.

Slides:



Advertisements
Similar presentations
Statistics and Quantitative Analysis U4320
Advertisements

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Lecture 28 Categorical variables: –Review of slides from lecture 27 (reprint of lecture 27 categorical variables slides with typos corrected) –Practice.
Multiple Regression II 4/11/12 Categorical explanatory variables Adjusted R 2 Not in book Professor Kari Lock Morgan Duke University.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6, 9.1 Least squares line Interpreting.
Introduction to Statistics: Political Science (Class 9) Review.
Regression With Categorical Variables. Overview Regression with Categorical Predictors Logistic Regression.
Stat 112: Lecture 10 Notes Fitting Curvilinear Relationships –Polynomial Regression (Ch ) –Transformations (Ch ) Schedule: –Homework.
Statistics for Managers Using Microsoft® Excel 5th Edition
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Lecture 24: Thurs., April 8th
Multiple Regression 2 Sociology 5811 Lecture 23 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor.
Simple Linear Regression Least squares line Interpreting coefficients Prediction Cautions The formal model Section 2.6, 9.1, 9.2 Professor Kari Lock Morgan.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Probability SECTION 11.1 Events and, or, if Disjoint Independent Law of total.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/27/12 Multiple Regression SECTION 10.3 Categorical variables Variable.
Synthesis and Review 3/26/12 Multiple Comparisons Review of Concepts Review of Methods - Prezi Essential Synthesis 3 Professor Kari Lock Morgan Duke University.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
3 CHAPTER Cost Behavior 3-1.
Chapter 14 Introduction to Multiple Regression Sections 1, 2, 3, 4, 6.
STAT 250 Dr. Kari Lock Morgan
Examining Relationships Prob. And Stat. CH.2.1 Scatterplots.
Modeling Possibilities
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 (?) Multiple explanatory variables.
1 1 Slide © 2016 Cengage Learning. All Rights Reserved. The equation that describes how the dependent variable y is related to the independent variables.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Chapter 14 Introduction to Multiple Regression
CHAPTER 14 MULTIPLE REGRESSION
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 9.1 Inference for correlation Inference for.
Multiple Regression I 4/9/12 Transformations The model Individual coefficients R 2 ANOVA for regression Residual standard error Section 9.4, 9.5 Professor.
Multiple Regression BPS chapter 28 © 2006 W.H. Freeman and Company.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTIONS 9.1, 9.3 Inference for slope (9.1)
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
1 Quadratic Model In order to account for curvature in the relationship between an explanatory and a response variable, one often adds the square of the.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Variable selection Confounding variables.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 12/6/12 Synthesis Big Picture Essential Synthesis Bayesian Inference (continued)
Lecture 12 Preview: Model Specification and Model Development Model Specification: Ramsey REgression Specification Error Test (RESET) RESET Logic Model.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Dummy Variables; Multiple Regression July 21, 2008 Ivan Katchanovski, Ph.D. POL 242Y-Y.
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Synthesis Big Picture Essential Synthesis Synthesis and Review.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6 Least squares line Interpreting coefficients.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 Multiple explanatory variables (10.1,
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTION 2.6 Interpreting coefficients Prediction.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Simple Linear Regression SECTION 9.1 Inference for correlation Inference for.
Stats Methods at IC Lecture 3: Regression.
Chapter 14 Introduction to Multiple Regression
Multiple Regression Analysis and Model Building
STAT 250 Dr. Kari Lock Morgan
Regression Analysis.
Regression and Categorical Predictors
Presentation transcript:

Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Categorical variables Variable selection Confounding variables revisited

Statistics: Unlocking the Power of Data Lock 5 Extra Credit Each of you has the option to earn up to 10 extra credit points Options herehere

Statistics: Unlocking the Power of Data Lock 5 US States We will build a model to predict the % of the state that voted for Obama (out of the two party vote) in the 2012 US presidential election, using the 50 states as cases This can help us to understand how certain features of a state are associated with political beliefs

Statistics: Unlocking the Power of Data Lock 5 A regression where the cases are states, the response variable is % vote for Obama in 2012 election (ObamaPer), and the explanatory variable is region of the country (Region) gives R 2 = Which of the following is true? (a) The correlation between ObamaPer and Region is 0.36 (b) 36% of the variability in ObamaPer is explained by Region (c) The correlation between ObamaPer and Region is √0.36 (d) √36% of the variability in ObamaPer is explained by Region Interpreting R 2

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables For this to make any sense, each x value has to be a number. How do we include categorical variables in a regression setting?

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables Take one categorical variable, and replace it with several “dummy” variables A dummy variable is 1 if the case falls into the category represented by the dummy variable, and 0 otherwise Create one dummy variable for each category of the categorical variable

Statistics: Unlocking the Power of Data Lock 5 Dummy Variables StateRegionSouthWestNortheastMidwest AlabamaSouth1000 AlaskaWest0100 ArkansasSouth1000 CaliforniaWest0100 ColoradoWest0100 ConnecticutNortheast0010 DelawareNortheast0010 FloridaSouth1000 GeorgiaSouth1000 HawaiiWest0100 ………………

Statistics: Unlocking the Power of Data Lock 5 Dummy Variables When using dummy variables, one has to be left out of the model The dummy variable left out is called the reference level When using region of the country (Northeast, South, Midwest, West) to predict % Obama vote, how many dummy variables will be included? a)Oneb) Twoc) Three d) Four

Statistics: Unlocking the Power of Data Lock 5 Dummy Variables Predicting % vote for Obama with one categorical variable: region of the country If “midwest” is the reference level: Predicted percentage vote for midwest state Increase in vote for a West state, compared to a Midwest state

Statistics: Unlocking the Power of Data Lock 5 Voting by Region Based on the output above, which region had the highest percent vote for Obama? a)Midwest b)Northeast c)South d)West

Statistics: Unlocking the Power of Data Lock 5 Voting by Region What is the predicted % Obama vote for a state in the northeast? a)13% b)47% c)55% d)60%

Statistics: Unlocking the Power of Data Lock 5 Voting by Region What is the predicted % Obama vote for a state in the midwest? a)50% b)47% c)0% d)45%

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables The p-value for each dummy variable tests for a significant difference between that category and the reference level For an overall p-value for the significance of the categorical variable with multiple categories, use a)z-test b)T-test c)Chi-square test d)ANOVA

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables ANOVA for Regression: ANOVA for Difference in Means:

Statistics: Unlocking the Power of Data Lock 5 p-values Do p-values make sense to use here? a) Yes b) No

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables in R R automatically creates dummy variables for you if you include a categorical explanatory variable The first level alphabetically is usually the reference level

Statistics: Unlocking the Power of Data Lock 5 Categorical Variables Either all dummy variables associated with a categorical variable have to be included in the model, or none of them RegionS and RegionW are not significant, but leaving them out would clump the South and the West with the reference level, Midwest, which does not make sense

Statistics: Unlocking the Power of Data Lock 5 Full Regression Model

Statistics: Unlocking the Power of Data Lock 5 West Region With only region as an explanatory variable, interpret the positive coefficient of RegionW. With all the other explanatory variables included, interpret the negative coefficient of RegionW. In this data set, states in the West voted more for Obama than states in the Midwest. States in the West voted less for Obama than would be expected based on the other variables in the model, as compared to states in the Midwest.

Statistics: Unlocking the Power of Data Lock 5 Smoking Given all the other variables in the model, states with a higher percentage of smokers are more likely to vote (a) Republican (b) Democratic (c) Impossible to tell

Statistics: Unlocking the Power of Data Lock 5 Smoking The correlation between percent of people smoking in a state and the percent of people voting for Obama in 2012 was (a) Positive (b) Negative (c) Impossible to tell

Statistics: Unlocking the Power of Data Lock 5 Smokers If smoking was banned in a state, the percentage of smokers would most likely decrease. In that case, the percentage voting Democratic would… (a) increase (b) decrease (c) impossible to tell

Statistics: Unlocking the Power of Data Lock 5 Goal of the Model? If the goal of the model is to see what and how each variable is associated with a state’s voting patterns, given all the other variables in the model, then we are done If the goal is to predict the % of the vote that will be for the democrat, say in the 2016 election, we want to prune out insignificant variables to improve the model

Statistics: Unlocking the Power of Data Lock 5 Over-fitting It is possible to over-fit a model: to include too many explanatory variables The fewer the coefficients being estimated, the better they will be estimated Usually, a good model has pruned out explanatory variables that are not helping

Statistics: Unlocking the Power of Data Lock 5 R2R2 Adding more explanatory variables will only make R 2 increase or stay the same Adding another explanatory variable can not make the model explain less, because the other variables are all still in the model Is the best model always the one with the highest proportion of variability explained, and so the highest R 2 ? (a) Yes(b) No

Statistics: Unlocking the Power of Data Lock 5 Adjusted R 2 Adjusted R 2 is like R 2, but takes into account the number of explanatory variables As the number of explanatory variables increases, adjusted R 2 gets smaller than R 2 One way to choose a model is to choose the model with the highest adjusted R 2

Statistics: Unlocking the Power of Data Lock 5 Adjusted R 2 You now know how to interpret all of these numbers!

Statistics: Unlocking the Power of Data Lock 5 Variable Selection The p-value for an explanatory variable can be taken as a rough measure for how helpful that explanatory variable is to the model Insignificant variables may be pruned from the model, as long as adjusted R 2 doesn’t decrease You can also look at relationships between explanatory variables; if two are strongly associated, perhaps both are not necessary

Statistics: Unlocking the Power of Data Lock 5 Variable Selection (Some) ways of deciding whether a variable should be included in the model or not: 1.Does it improve adjusted R 2 ? 2.Does it have a low p-value? 3.Is it associated with the response by itself? 4.Is it strongly associated with another explanatory variables? (If yes, then including both may be redundant) 5.Does common sense say it should contribute to the model?

Statistics: Unlocking the Power of Data Lock 5 Stepwise Regression We could go through and think hard about which variables to include, or we could automate the process Stepwise regression drops insignificant variables one by one This is particularly useful if you have many potential explanatory variables

Statistics: Unlocking the Power of Data Lock 5 Full Model Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 1 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 2 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 3 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 4 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 5 Highest p-value

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 6

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 5

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 7

Statistics: Unlocking the Power of Data Lock 5 Pruned Model 5 FINAL STEPWISE MODEL

Statistics: Unlocking the Power of Data Lock 5 Full Model

Statistics: Unlocking the Power of Data Lock 5 Variable Selection There is no one “best” model Choosing a model is just as much an art as a science Adjusted R 2 is just one possible criteria To learn much more about choosing the best model, take STAT 210

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Cases: countries of the world Response variable: life expectancy Explanatory variable: electricity use (kWh per capita) Is a country’s electricity use helpful in predicting life expectancy?

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Outlier: Iceland

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Is this a good model for predicting life expectancy based on electricity use? (a) Yes (b) No

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Is a country’s electricity use helpful in predicting life expectancy? (a) Yes (b) No

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy If we increased electricity use in a country, would life expectancy increase? (a) Yes (b) No (c) Impossible to tell

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy If we increased electricity use in a country, would life expectancy increase? (a) Yes (b) No (c) Impossible to tell

Statistics: Unlocking the Power of Data Lock 5 Confounding Variables Wealth is an obvious confounding variable that could explain the relationship between electricity use and life expectancy Multiple regression is a powerful tool that allows us to account for confounding variables We can see whether an explanatory variable is still significant, even after including potential confounding variables in the model

Statistics: Unlocking the Power of Data Lock 5 Electricity and Life Expectancy Is a country’s electricity use helpful in predicting life expectancy, even after including GDP in the model? (a) Yes(b) No

Statistics: Unlocking the Power of Data Lock 5 Which is the “best” model? (a) (b) (c)

Statistics: Unlocking the Power of Data Lock 5 Cases: countries of the world Response variable: life expectancy Explanatory variable: number of mobile cellular subscriptions per 100 people Is a country’s cell phone subscription rate helpful in predicting life expectancy? Cell Phones and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Cell Phones and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Cell Phones and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Cell Phones and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Is this a good model for predicting life expectancy based on cell phone subscriptions? (a) Yes (b) No Cell Phones and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Is a country’s number of cell phone subscriptions per capita helpful in predicting life expectancy? (a) Yes (b) No Cell Phones and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 If we gave everyone in a country a cell phone and a cell phone subscription, would life expectancy in that country increase? (a) Yes (b) No (c) Impossible to tell Cell Phones and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Is a country’s cell phone subscription rate helpful in predicting life expectancy, even after including GDP in the model? (a) Yes(b) No Cell Phones and Life Expectancy

Statistics: Unlocking the Power of Data Lock 5 Cell Phones and Life Expectancy This says that wealth alone can not explain the association between cell phone subscriptions and life expectancy This suggests that either cell phones actually do something to increase life expectancy (causal) OR there is another confounding variable besides wealth of the country

Statistics: Unlocking the Power of Data Lock 5 Confounding Variables Multiple regression is one potential way to account for confounding variables This is most commonly used in practice across a wide variety of fields, but is quite sensitive to the conditions for the linear model (particularly linearity) You can only “rule out” confounding variables that you have data on, so it is still very hard to make true causal conclusions without a randomized experiment

Statistics: Unlocking the Power of Data Lock 5 To Do Read 10.3 Do Homework 8 (due Wednesday, 4/16) Do Project 2 (due Wednesday, 4/23)