2.4 Cautions about Correlation and Regression. Residuals (again!) Recall our discussion about residuals- what is a residual? The idea for line of best.

Slides:



Advertisements
Similar presentations
Section 4.2. Correlation and Regression Describe only linear relationship. Strongly influenced by extremes in data. Always plot data first. Extrapolation.
Advertisements

Scatterplots and Correlation
Chapter 4: More on Two- Variable Data.  Correlation and Regression Describe only linear relationships Are not resistant  One influential observation.
Agresti/Franklin Statistics, 1 of 52 Chapter 3 Association: Contingency, Correlation, and Regression Learn …. How to examine links between two variables.
AP Statistics Section 4.2 Relationships Between Categorical Variables.
Scatter Diagrams and Linear Correlation
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.4 Cautions in Analyzing.
Chapter 2: Looking at Data - Relationships /true-fact-the-lack-of-pirates-is-causing-global-warming/
Looking at Data-Relationships 2.1 –Scatter plots.
Section 2.6 Relations in Categorical Variables So far in chapter two we have dealt with data that is quantitative. In this section we consider categorical.
CHAPTER 3 Describing Relationships
Ch 2 and 9.1 Relationships Between 2 Variables
Chapter 5 Regression. Chapter outline The least-squares regression line Facts about least-squares regression Residuals Influential observations Cautions.
The Practice of Statistics
ASSOCIATION: CONTINGENCY, CORRELATION, AND REGRESSION Chapter 3.
2.4: Cautions about Regression and Correlation. Cautions: Regression & Correlation Correlation measures only linear association. Extrapolation often produces.
1 Chapter 3: Examining Relationships 3.1Scatterplots 3.2Correlation 3.3Least-Squares Regression.
AP STATISTICS Section 4.2 Relationships between Categorical Variables.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.1 Using Several Variables to Predict a Response.
Chapter 3 concepts/objectives Define and describe density curves Measure position using percentiles Measure position using z-scores Describe Normal distributions.
1 Chapter 4: More on Two-Variable Data 4.1Transforming Relationships 4.2Cautions 4.3Relations in Categorical Data.
CHAPTER 7: Exploring Data: Part I Review
Section 2.2 Correlation A numerical measure to supplement the graph. Will give us an indication of “how closely” the data points fit a particular line.
1 Chapter 4: More on Two-Variable Data 4.1Transforming Relationships 4.2Cautions 4.3Relations in Categorical Data.
HW#8: Chapter 2.5 page Complete three questions on the last two slides.
Chapter 4 More on Two-Variable Data “Each of us is a statistical impossibility around which hover a million other lives that were never destined to be.
Lecture Presentation Slides SEVENTH EDITION STATISTICS Moore / McCabe / Craig Introduction to the Practice of Chapter 2 Looking at Data: Relationships.
CHAPTER 6: Two-Way Tables. Chapter 6 Concepts 2  Two-Way Tables  Row and Column Variables  Marginal Distributions  Conditional Distributions  Simpson’s.
Data Analysis for Two-Way Tables. The Basics Two-way table of counts Organizes data about 2 categorical variables Row variables run across the table Column.
1 Chapter 4: More on Two-Variable Data 4.1Transforming Relationships 4.2Cautions 4.3Relations in Categorical Data.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Examining Bivariate Data Unit 3 – Statistics. Some Vocabulary Response aka Dependent Variable –Measures an outcome of a study Explanatory aka Independent.
CHAPTER 5 Regression BPS - 5TH ED.CHAPTER 5 1. PREDICTION VIA REGRESSION LINE NUMBER OF NEW BIRDS AND PERCENT RETURNING BPS - 5TH ED.CHAPTER 5 2.
BPS - 3rd Ed. Chapter 61 Two-Way Tables. BPS - 3rd Ed. Chapter 62 u In this chapter we will study the relationship between two categorical variables (variables.
Stat1510: Statistical Thinking and Concepts Two Way Tables.
Two-Way Tables Categorical Data. Chapter 4 1.  In this chapter we will study the relationship between two categorical variables (variables whose values.
Aim: How do we analyze data with a two-way table?
Chapter 2 Examining Relationships.  Response variable measures outcome of a study (dependent variable)  Explanatory variable explains or influences.
Correlation/Regression - part 2 Consider Example 2.12 in section 2.3. Look at the scatterplot… Example 2.13 shows that the prediction line is given by.
Business Statistics for Managerial Decision Making
Chapter 6 Two-Way Tables BPS - 5th Ed.Chapter 61.
BPS - 3rd Ed. Chapter 61 Two-Way Tables. BPS - 3rd Ed. Chapter 62 u In prior chapters we studied the relationship between two quantitative variables with.
AP Statistics Section 4.2 Relationships Between Categorical Variables
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
CHAPTER 5: Regression ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Statistics 101 Chapter 3 Section 3.
CHAPTER 3 Describing Relationships
Cautions about Correlation and Regression
Chapter 2: Looking at Data — Relationships
Second factor: education
Chapter 2 Looking at Data— Relationships
AP Exam Review Chapters 1-10
Least-Squares Regression
CHAPTER 3 Describing Relationships
Second factor: education
Chapter 2 Looking at Data— Relationships
Section 4-3 Relations in Categorical Data
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Warmup A study was done comparing the number of registered automatic weapons (in thousands) along with the murder rate (in murders per 100,000) for 8.
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Section Way Tables and Marginal Distributions
Relations in Categorical Data
Chapter 4: More on Two-Variable Data
CHAPTER 3 Describing Relationships
Presentation transcript:

2.4 Cautions about Correlation and Regression

Residuals (again!) Recall our discussion about residuals- what is a residual? The idea for line of best fit was to “minimize” the residuals. While this can always be done, when is the line of best fit a good fit (as opposed to some curve)?

Residual Plots As we hinted at in 2.3, the mean of the least squares residuals is always 0. This leads to the idea of a residual plot; that is, a scatterplot of the residuals against the explanatory variable. Residual plots help us assess the fit of a regression line.

We began Section 2.2 with the graph What does the residual plot look like?

No overall pattern means line is a good fit.

We also looked at this graph in section 2.3 Looks great!

But look at residual plot. Seems to be a pattern at first.

Consider the following data: XY Line of best fit is y = x. (2, 20) and (20,2) seem problematic.

XY Line of best fit is y = x

XY Line of best fit is y = x

All data Without (2,20) Without (20,2)

What was the point of all that? Eliminating a value which has a large deviation in y made for the best residual plot, and it didn’t change the least squares regression much. Eliminating a value which has a large deviation in x considerably changed the least squares regression line. An observation is influential for a calculation if removing it would drastically change the result of the calculation. Usually an extreme outlier in the x direction is considered influential but not an extreme outlier in the y direction. This leads to the truth that while correlation can imply causation, it does not necessarily.

Lurking Variable When we consider ice cream sales and drowning deaths these variables, they will show a positive and potentially statistically significant correlation. Certainly there is no causal relationship here, despite what the data implies. There is a third variable that is “lurking” behind all of this; namely summertime. A lurking variable is a variable that is not among the explanatory or response variables in a study and still may yet influence the interpretation of relationships among those variables.

2.5 Data Analysis for Two-Way Tables

Categorical Data Scatter plots are good for quantitative data, but what if we have categorical data? For example, suppose are interested in comparing the number of people in different kinds of colleges according to gender. We use a two-way table to present the data.

StatusMen (in thousands)Women (in thousands) Two-year college, full-time Two-year college, part-time Four-year college, full-time Four-year college, part-time Graduate school Vocation school Easy to read off data. Could have included a “total” row. This two-way table gives the actual numbers. Perhaps percentages would be more helpful. The following is a joint distribution.

StatusMenWomen Two-year college, full-time Two-year college, part-time Four-year college, full-time Four-year college, part-time Graduate school Vocation school This is a joint distribution. It is obtained by dividing the cell entry by the total sample size. Notice that the sum of all the values should be 1 (sometimes there is slight round-off error). If we are interested in the percentage of each variable, we use a marginal distribution.

StatusMenWomenTotal Two-year college, full-time /10421=18% Two-year college, part-time /10421= 7% Four-year college, full-time /10421=60% Four-year college, part-time /10421=6% Graduate school /10421=6% Vocation school /10421=3% Total4842/10421= 46% 5579/10421= 54% This is a marginal distribution. The marginal distribution in gender is given in the bottom row, and the marginal distribution in status is given in the rightmost column. Now a bar graph can help display the data. We’ll graph status of those in college.

StatusMenWomen Two-year college, full-time48%52% Two-year college, part-time46%54% Four-year college, full-time47%53% Four-year college, part-time39%61% Graduate school46%54% Vocation school54%46% This is a conditional distribution in status since we are fixing a status. We could have also looked at a conditional distribution in gender.

Simpson’s Paradox Simpson’s paradox occurs when the success of a group seems reversed when the groups are combined. Let’s look at an example.

Simpson’s Paradox cont Combined Dave Justice 104/411= /140= /551=.270 Derek Jeter 12/48= /582= /630=.310 Justice had a better batting average in 1995 and 1996 than Jeter, but Jeter had a better combined batting average than Justice. Why? Consider the following made-up table of batting averages.

Simpson’s Paradox cont Combined Dave Justice ½=.500 1/1= /3=.666 Derek Jeter 0/1=.00099/100=.99100/101=.990 Notice that the paradox can be attributed to the fact that Jeter’s number of at-bats in 1996 dominates. Notice that the paradox can be attributed to the fact that Jeter’s number of at-bats in 1996 dominates.