Multiple Regression II 4/11/12 Categorical explanatory variables Adjusted R 2 Not in book Professor Kari Lock Morgan Duke University.

Slides:



Advertisements
Similar presentations
STAT 101 Dr. Kari Lock Morgan
Advertisements

Simple Linear Regression Conditions Confidence intervals Prediction intervals Section 9.1, 9.2, 9.3 Professor Kari Lock Morgan Duke University.
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Hypothesis Testing I 2/8/12 More on bootstrapping Random chance
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTION 2.6, 9.1 Least squares line Interpreting.
Chapter Thirteen McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved. Linear Regression and Correlation.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Categorical variables Variable selection.
Introduction to Statistics: Political Science (Class 9) Review.
CHAPTER 8: LINEAR REGRESSION
Regression With Categorical Variables. Overview Regression with Categorical Predictors Logistic Regression.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Korelasi Ganda Dan Penambahan Peubah Pertemuan 13 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
January 6, morning session 1 Statistics Micro Mini Multiple Regression January 5-9, 2008 Beth Ayers.
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
STAT 101 Dr. Kari Lock Morgan Exam 2 Review.
Quantitative Methods – Week 8: Multiple Linear Regression
Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor.
Simple Linear Regression Least squares line Interpreting coefficients Prediction Cautions The formal model Section 2.6, 9.1, 9.2 Professor Kari Lock Morgan.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/27/12 Multiple Regression SECTION 10.3 Categorical variables Variable.
Synthesis and Review 3/26/12 Multiple Comparisons Review of Concepts Review of Methods - Prezi Essential Synthesis 3 Professor Kari Lock Morgan Duke University.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Correlation Scatter Plots Correlation Coefficients Significance Test.
Chapter 14 Introduction to Multiple Regression Sections 1, 2, 3, 4, 6.
STAT 250 Dr. Kari Lock Morgan
September In Chapter 14: 14.1 Data 14.2 Scatterplots 14.3 Correlation 14.4 Regression.
Confidence Intervals I 2/1/12 Correlation (continued) Population parameter versus sample statistic Uncertainty in estimates Sampling distribution Confidence.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 (?) Multiple explanatory variables.
Quantitative Skills 1: Graphing
CHAPTER 14 MULTIPLE REGRESSION
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 12.4.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables.
Cross-Tabs With Nominal Variables 10/24/2013. Readings Chapter 7 Tests of Significance and Measures of Association (Pollock) (pp ) Chapter 5 Making.
Regression. Population Covariance and Correlation.
Economics 173 Business Statistics Lecture 20 Fall, 2001© Professor J. Petry
Multiple Regression I 4/9/12 Transformations The model Individual coefficients R 2 ANOVA for regression Residual standard error Section 9.4, 9.5 Professor.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 Linear Regression.
Inference after ANOVA, Multiple Comparisons 3/21/12 Inference after ANOVA The problem of multiple comparisons Bonferroni’s Correction Section 8.2 Professor.
Section 9-1: Inference for Slope and Correlation Section 9-3: Confidence and Prediction Intervals Visit the Maths Study Centre.
Multiple Regression BPS chapter 28 © 2006 W.H. Freeman and Company.
Chapter 16 Data Analysis: Testing for Associations.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTIONS 9.1, 9.3 Inference for slope (9.1)
Bayesian Inference, Review 4/25/12 Frequentist inference Bayesian inference Review The Bayesian Heresy (pdf)pdf Professor Kari Lock Morgan Duke University.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTION 10.3 Variable selection Confounding variables.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 12/6/12 Synthesis Big Picture Essential Synthesis Bayesian Inference (continued)
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
CHAPTER 8 Linear Regression. Residuals Slide  The model won’t be perfect, regardless of the line we draw.  Some points will be above the line.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 8- 1.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Statistics: Unlocking the Power of Data Lock 5 STAT 250 Dr. Kari Lock Morgan Multiple Regression SECTIONS 10.1, 10.3 Multiple explanatory variables (10.1,
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Synthesis and Review 2/20/12 Hypothesis Tests: the big picture Randomization distributions Connecting intervals and tests Review of major topics Open Q+A.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
BPA CSUB Prof. Yong Choi. Midwest Distribution 1. Create scatter plot Find out whether there is a linear relationship pattern or not Easy and simple using.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/6/12 Simple Linear Regression SECTION 2.6 Interpreting coefficients Prediction.
Descriptive measures of the degree of linear association R-squared and correlation.
Midterm Review IN CLASS. Chapter 1: The Art and Science of Data 1.Recognize individuals and variables in a statistical study. 2.Distinguish between categorical.
Chapter 14 Introduction to Multiple Regression
Example 6 Voting The table shows the percent of voting-age population who voted in presidential elections for the years 1960–2004. Graph the data points,
Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03
Regression and Categorical Predictors
Presentation transcript:

Multiple Regression II 4/11/12 Categorical explanatory variables Adjusted R 2 Not in book Professor Kari Lock Morgan Duke University

Regrade requests must be submitted in writing, with the original project, by Friday, 4/13/12, at 5pm If regraded, I will grade the entire project, and the grade may go up or down Project 1 Regrade Requests

Project 2 Proposal (due TODAY, 5pm) Project 2 Proposal Homework 9 (due Monday, 4/16) Homework 9 Project 2 Presentation (Thursday, 4/19) Project 2 Presentation Project 2 Paper (Wednesday, 4/25) Project 2 Paper To Do

Chapter 9 odd solutions available Chapter 9 odd solutions Chapter 9 Odd Solutions

Your group members are in the same lab (mostly) so this is a great time to meet with your group in person There is no “On Your Own” and I’ve tried to make the lab short, so you should have some time with your group to get started with your project If you want to work through the lab on your own before coming to class, you can have the whole lab time with your group to work on the project Come to lab tomorrow! Lab

US States We will build a model to predict the % of the state that voted Republican in the 2008 US presidential election, using the 50 states as cases Sample? Population? This can help us to understand how certain features of a state are associated with political beliefs

US States Response Variable: Our first explanatory variable is region of the country: Midwest, Northeast, South, or West

Categorical Variables For this to make any sense, each x value has to be a number. How do we include categorical variables in a regression setting?

Categorical Variables Take one categorical variable, and replace it with several “dummy” variables A dummy variable is 1 if the case falls into the category represented by the dummy variable, and 0 otherwise Create one dummy variable for each category of the categorical variable

Dummy Variables StateRegionSouthWestNortheastMidwest AlabamaSouth1000 AlaskaWest0100 ArkansasSouth1000 CaliforniaWest0100 ColoradoWest0100 ConnecticutNortheast0010 DelawareNortheast0010 FloridaSouth1000 GeorgiaSouth1000 HawaiiWest0100 ………………

Dummy Variables When using dummy variables, one has to be left out of the model The dummy variable left out is called the reference level When using region of the country (Northeast, South, Midwest, West) to predict % McCain vote, how many dummy variables will be included? a)Oneb) Twoc) Three d) Four

Dummy Variables Predicting % vote for McCain with one categorical variable: region of the country If “midwest” is the reference level:

Voting by Region Based on the output above, which region had the highest percent vote for McCain? a)Midwest b)Northeast c)South d)West

Voting by Region What is the predicted % Republican vote for a state in the northeast? a)–10.2% b)48.6% c)38.4% d)58.8%

Voting by Region What is the predicted % Republican vote for a state in the midwest? a)50% b)48.6% c)0% d)58.8%

Categorical Variables The p-value for each dummy variable tests for a significant difference between that category and the reference level For an overall p-value for the significance of the categorical variable with multiple categories, use a)z-test b)T-test c)Chi-square test d)ANOVA

Categorical Variables

Categorical Variables in R R automatically creates dummy variables for you if you include a categorical explanatory variable The first level alphabetically is usually the reference level If you want to change the reference level, see me

Categorical Variables Either all dummy variables associated with a categorical variable have to be included in the model, or none of them RegionW is not significant, but leaving it out would clump West with the reference level, Midwest, which does not make sense

Variables Let’s include some more explanatory variables! What helps to predict % voting Republican?

Categorical Variables Be careful not to include a categorical variable for which every case is it’s own category Example: using “State” as an explanatory variable would be silly, even though R 2 = 1! If you want to know how each state voted, it would make more sense to just look directly at McCainVote, rather than fitting a model and giving each state it’s own coefficient

Explanatory Variables Also, be careful not to include explanatory variables that are essentially just another form of the response variable For example, ObamaMcCain is “M” if the state went for McCain, and “O” if the state went for Obama This is certainly associated with the % of people in the state that voted for McCain, but tells us nothing interesting

Explanatory Variables Models should be created either to learn about relationships between explanatory variables and the response, or for prediction Make sure the explanatory variables you include in the model are not contradicting the point of the model

Visualization How would we visualize the association between region and % vote for McCain? a)Scatterplot b)Side-by-side boxplots c)Bar chart d)Pie chart e)Mosaic plot

Side-by-Side Boxplots

Test for Association How would we test for an association between region and % vote for McCain? a)t-test for difference in means b)test for a correlation c)ANOVA d)chi-square test e)test for a difference in proportions

ANOVA

Visualization All of the other potential explanatory variables are quantitative. How would we visualize the association between each of them and % vote for McCain? a)Scatterplot b)Side-by-side boxplots c)Bar chart d)Pie chart e)Mosaic plot

What do you see?

Test for Association How would we test the association between each of these variables and % vote for McCain? a)t-test for difference in means b)test for a correlation c)ANOVA d)chi-square test e)test for a difference in proportions

Test for Correlation

Regression Model

Physical Activity Given all the other variables in the model, states with a higher percentage of physically active citizens are more likely to vote (a) Republican (b) Democratic

West Region With only region as an explanatory variable, interpret the meaning of the negative coefficient of RegionW. With all the other explanatory variables included, interpret the meaning of the positive coefficient of RegionW. In this data set, states in the West voted less Republican than states in the Midwest. States in the West voted more Republican than would be expected based on the other variables in the model, as compared to states in the Midwest.

Goal of the Model? If the goal of the model is to see what and how each variable is associated with a state’s voting patterns, given all the other variables in the model, then we are done If the goal is to predict the % of the vote that will be Republican, say in the 2012 election, we want to prune out insignificant variables to improve the model

Over-fitting It is possible to over-fit a model: to include too many explanatory variables The fewer the coefficients being estimated, the better they will be estimated Usually, a good model has pruned out explanatory variables that are not helping

R2R2 Adding more explanatory variables will only make R 2 increase or stay the same Adding another explanatory variable can not make the model explain less, because the other variables are all still in the model Is the best model always the one with the highest proportion of variability explained, and so the highest R 2 ? (a) Yes(b) No

Adjusted R 2 Adjusted R 2 is like R 2, but takes into account the number of explanatory variables As the number of explanatory variables increases, adjusted R 2 gets smaller than R 2 One way to choose a model is to choose the model with the highest adjusted R 2

Adjusted R 2 You now know how to interpret all of these numbers!