Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.1 The Association.

Slides:



Advertisements
Similar presentations
Correlation and Linear Regression
Advertisements

Chapter 4 The Relation between Two Variables
Agresti/Franklin Statistics, 1 of 52 Chapter 3 Association: Contingency, Correlation, and Regression Learn …. How to examine links between two variables.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.1 The Association.
Agresti/Franklin Statistics, 1 of 63  Section 2.4 How Can We Describe the Spread of Quantitative Data?
Chapter 6: Exploring Data: Relationships Lesson Plan
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Chapter 3 Association: Contingency, Correlation, and Regression
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.4 Cautions in Analyzing.
2.4 Cautions about Correlation and Regression. Residuals (again!) Recall our discussion about residuals- what is a residual? The idea for line of best.
Chapter 2: Looking at Data - Relationships /true-fact-the-lack-of-pirates-is-causing-global-warming/
BPS - 5th Ed. Chapter 51 Regression. BPS - 5th Ed. Chapter 52 u Objective: To quantify the linear relationship between an explanatory variable (x) and.
Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc. Chapter Describing the Relation between Two Variables 4.
Ch 2 and 9.1 Relationships Between 2 Variables
Basic Practice of Statistics - 3rd Edition
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related, or correlated; how much they.
Chapter 5 Regression. Chapter 51 u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We.
Descriptive Methods in Regression and Correlation
Relationship of two variables
Slide Copyright © 2008 Pearson Education, Inc. Chapter 4 Descriptive Methods in Regression and Correlation.
ASSOCIATION: CONTINGENCY, CORRELATION, AND REGRESSION Chapter 3.
Stat 1510: Statistical Thinking and Concepts Scatterplots and Correlation.
Chapter 6: Exploring Data: Relationships Lesson Plan Displaying Relationships: Scatterplots Making Predictions: Regression Line Correlation Least-Squares.
BPS - 3rd Ed. Chapter 41 Scatterplots and Correlation.
BPS - 3rd Ed. Chapter 41 Scatterplots and Correlation.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.2 The Association.
Chapter 3 concepts/objectives Define and describe density curves Measure position using percentiles Measure position using z-scores Describe Normal distributions.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Chapter 3 Section 3.1 Examining Relationships. Continue to ask the preliminary questions familiar from Chapter 1 and 2 What individuals do the data describe?
Chapter 10 Correlation and Regression
Essential Statistics Chapter 41 Scatterplots and Correlation.
Notes Bivariate Data Chapters Bivariate Data Explores relationships between two quantitative variables.
BPS - 3rd Ed. Chapter 51 Regression. BPS - 3rd Ed. Chapter 52 u Objective: To quantify the linear relationship between an explanatory variable (x) and.
Chapter 5 Regression BPS - 5th Ed. Chapter 51. Linear Regression  Objective: To quantify the linear relationship between an explanatory variable (x)
BPS - 5th Ed. Chapter 51 Regression. BPS - 5th Ed. Chapter 52 u Objective: To quantify the linear relationship between an explanatory variable (x) and.
1 EXPLORING RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES SCATTERPLOTS, ASSOCIATION, AND CORRELATION ADDITIONAL REFERENCE READING MATERIAL COURSEPACK PAGES.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 12: Analyzing the Association Between Quantitative Variables: Regression Analysis Section.
CHAPTER 4 SCATTERPLOTS AND CORRELATION BPS - 5th Ed. Chapter 4 1.
Chapter 4 Scatterplots and Correlation. Explanatory and Response Variables u Interested in studying the relationship between two variables by measuring.
Relationships If we are doing a study which involves more than one variable, how can we tell if there is a relationship between two (or more) of the.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Chapter 7 Scatterplots, Association, and Correlation.
Chapter 5 Regression. u Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). u We can then predict.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
1 Association  Variables –Response – an outcome variable whose values exhibit variability. –Explanatory – a variable that we use to try to explain the.
Chapter 2 Examining Relationships.  Response variable measures outcome of a study (dependent variable)  Explanatory variable explains or influences.
Business Statistics for Managerial Decision Making
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.3 Predicting the Outcome.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
BPS - 3rd Ed. Chapter 51 Regression. BPS - 3rd Ed. Chapter 52 u To describe the change in Y per unit X u To predict the average level of Y at a given.
Stat 1510: Statistical Thinking and Concepts REGRESSION.
Unit 3 Correlation. Homework Assignment For the A: 1, 5, 7,11, 13, , 21, , 35, 37, 39, 41, 43, 45, 47 – 51, 55, 58, 59, 61, 63, 65, 69,
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Chapter 5: 02/17/ Chapter 5 Regression. 2 Chapter 5: 02/17/2004 Objective: To quantify the linear relationship between an explanatory variable (x)
Essential Statistics Chapter 41 Scatterplots and Correlation.
Chapter 3 Association: Contingency, Correlation, and Regression Section 3.1 How Can We Explore the Association between Two Categorical Variables?
Essential Statistics Regression
Basic Practice of Statistics - 5th Edition
CHAPTER 3 Describing Relationships
Chapter 3 Association: Contingency, Correlation, and Regression
Basic Practice of Statistics - 3rd Edition Regression
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Basic Practice of Statistics - 3rd Edition Lecture Powerpoint
Honors Statistics Review Chapters 7 & 8
Basic Practice of Statistics - 3rd Edition
Presentation transcript:

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.1 The Association Between Two Categorical Variables

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 3 Response variable (Dependent Variable)  The outcome variable on which comparisons are made. Explanatory variable (Independent variable)  When the explanatory variable is categorical, it defines the groups to be compared with respect to values on the response variable.  When the explanatory variable is quantitative, it defines the change in different numerical values to be compared with respect to the values for the response variable. Example: Response/Explanatory  Survival status/ Smoking status  Carbon dioxide(CO 2 )Level/Amount of gasoline use for cars  College GPA/Number of hours a week spent studying Response and Explanatory Variables

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 4 The main purpose of data analysis with two variables is to investigate whether there is an association and to describe that association. An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable. Association Between Two Variables

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 5 A Contingency Table:  Displays two categorical variables  The rows list the categories of one variable  The columns list the categories of the other variable  Entries in the table are frequencies Contingency Tables

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 6 Table 3.1 Frequencies for Food Type and Pesticide Status. The row totals and the column totals are the frequencies for the categories of each variable. The counts inside the table give information about the association. Contingency Tables

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 7 These proportions are called conditional proportions because their formation is conditional on (in this example) food type. Calculate Proportions and Conditional Proportions Table 3.2 Conditional Proportions on Pesticide Status, for Two Food Types. These conditional proportions (using two decimal places) treat pesticide status as the response variable. The sample size n in a row shows the total on which the conditional proportions in that row were based.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 8 Questions: 1. What proportion of organic foods contain pesticides? 2. What proportion of conventionally grown foods contain pesticides? 3. What proportion of all sampled items contain pesticides? Calculate Proportions and Conditional Proportions

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 9 Using side by side bar charts to show conditional proportions allows for easy comparison of the explanatory variable with respect to the response variable. Figure 3.2 Conditional Proportions on Pesticide Status, Given the Food Type. For a particular pesticide status category, the side-by-side bars compare the two food types. Question: Comparing the bars, how would you describe the difference between organic and conventionally grown foods in the conditional proportion with pesticide residues present? Calculate Proportions and Conditional Proportions

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 10 If there was no association between organic and conventional foods, then the proportions for the response variable categories would be the same for each food type. Figure 3.3 Hypothetical Conditional Proportions on Pesticide Status, Given Food Type, Showing No Association. Question: What’s the difference between Figures 3.2 and 3.3 in the pattern shown by the bars in the graph? Calculate Proportions and Conditional Proportions

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.2 The Association Between Two Quantitative Variables

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 12 Graphical display of relationship between two quantitative variables:  Horizontal Axis: Explanatory variable, x  Vertical Axis: Response variable, y The Scatterplot: Looking for a Trend Figure 3.5 MINITAB Scatterplot for Internet Use and Facebook Use for 33 Countries. The point for Japan is labeled and has coordinates x = 74 and y = 2. Question: Is there any point that you would identify as standing out in some way? Which country does it represent, and how is it unusual in the context of these variables?

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 13 Example: Internet and Facebook Penetration Rates Table 3.4 Internet and Facebook Penetration Rates For 33 Countries

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 14 Table 3.4 Internet and Facebook Penetration Rates For 33 Countries, cont’d. Example: Internet and Facebook Penetration Rates

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 15 Example: Internet and Facebook Penetration Rates Using MINITAB, we obtain the following numerical measures of center and spread: Variable N Mean StDev Minimum Q1 MedianQ3 Max Internet Use Facebook Use Figure 3.4 MINITAB histograms of Internet use and Facebook use for the 33 countries. Question: Which nations, if any, might be outliers in terms of Internet use? Facebook use? Which graphical display would more clearly identify potential outliers?

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 16 We examine a scatterplot to study association. How do values on the response variable change as values of the explanatory variable change? You can describe the overall pattern of a scatterplot by the trend, direction, and strength of the relationship between the two variables.  Trend: linear, curved, clusters, no pattern  Direction: positive, negative, no direction  Strength: how closely the points fit the trend Also look for outliers from the overall trend. How to Examine a Scatterplot

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 17 Two quantitative variables x and y are  Positively associated when  high values of x tend to occur with high values of y.  low values of x tend to occur with low values of y.  Negatively associated when high values of one variable tend to pair with low values of the other variable. Interpreting Scatterplots: Direction/Association

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 18 Question: Would you expect a positive association, a negative association or no association between the age of the car and the mileage on the odometer?  Positive association  Negative association  No association Example: 100 cars on the lot of a used-car dealership

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 19 In Figure 3.6, one point falls well above the others. This severe outlier is the observation for Palm Beach County, the county that had the butterfly ballot. It is far removed from the overall trend for the other 66 data points, which follow an approximately straight line. Example: The Butterfly Ballot and the 2000 U.S. Presidential Election Figure 3.6 MINITAB Scatterplot of Florida Countywide Vote for Reform Party Candidates Pat Buchanan in 2000 and Ross Perot in Question: Why is the top point, but not each of the two rightmost points, considered an outlier relative to the overall trend of the data points?

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 20 The Correlation measures the strength and direction of the linear association between x and y.  A positive r value indicates a positive association.  A negative r value indicates a negative association.  An r value close to +1 or -1 indicates a strong linear association.  An r value close to 0 indicates a weak association. Summarizing the Strength of Association: The Correlation, r

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 21 Let’s get a feel for the correlation r by looking at its values for the scatterplots shown in Figure 3.7: Correlation Coefficient: Measuring Strength and Direction of a Linear Relationship Figure 3.7 Some Scatterplots and Their Correlations. The correlation gets closer to when the data points fall closer to a straight line. Question: Why are the cases in which the data points are closer to a straight line considered to represent stronger association?

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 22  Always falls between -1 and +1.  Sign of correlation denotes direction  (-) indicates negative linear association  (+) indicates positive linear association  Correlation has a unit-less measure, it does not depend on the variables’ units.  Two variables have the same correlation no matter which is treated as the response variable.  Correlation is not resistant to outliers.  Correlation only measures strength of a linear relationship. Properties of Correlation

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 23 Calculating the Correlation Coefficient CountryPer Capita GDP (x)Life Expectancy (y) Austria Belgium Finland France Germany Ireland Italy Netherlands Switzerland United Kingdom Example: Per Capita Gross Domestic Product and Average Life Expectancy for Countries in Western Europe.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 24 xY = = sum = s x =1.532s y =0.795 Example: Per Capita Gross Domestic Product and Average Life Expectancy for Countries in Western Europe. Calculating the Correlation Coefficient

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.3 Predicting the Outcome of a Variable

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 26 The first step of a regression analysis is to identify the response and explanatory variables.  We use y to denote the response variable.  We use x to denote the explanatory variable. Regression Line

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 27 The regression line predicts the value for the response variable y as a straight-line function of the value x of the explanatory variable. Let denote the predicted value of y. The equation for the regression line has the form In this formula, a denotes the y-intercept and b denotes the slope. Regression Line: An Equation for Predicting the Response Outcome

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 28 Regression Equation: is the predicted height and is the length of a femur (thighbone), measured in centimeters. Use the regression equation to predict the height of a person whose femur length was 50 centimeters. Example: Height Based on Human Remains

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 29 y-Intercept:  The predicted value for y when  This fact helps in plotting the line  May not have any interpretative value if no observations had x values near 0 It does not make sense for femur length to be 0 cm, so the y-intercept for the equation is not a relevant predicted height. Interpreting the y-Intercept

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 30 Slope: measures the change in the predicted variable (y) for a 1 unit increase in the explanatory variable (x). Example: A 1 cm increase in femur length results in a 2.4 cm increase in predicted height. Interpreting the Slope

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 31 Slope Values: Positive, Negative, Equal to 0 Figure 3.12 Three Regression Lines Showing Positive Association (slope > 0), Negative Association (slope < 0) and No Association (slope = 0). Question Would you expect a positive or negative slope when y = annual income and x = number of years of education?

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 32 Residuals measure the size of the prediction errors, the vertical distance between the point and the regression line.  Each observation has a residual  Calculation for each residual:  A large residual indicates an unusual observation.  The smaller the absolute value of a residual, the closer the predicted value is to the actual value, so the better is the prediction. Residuals Measure the Size of Prediction Errors

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 33 Residual sum of squares: The least squares regression line is the line that minimizes the vertical distance between the points and their predictions, i.e., it minimizes the residual sum of squares. Note: The sum of the residuals about the regression line will always be zero. The Method of Least Squares Yields the Regression Line

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 34 Slope: y-Intercept: Notice that the slope b is directly related to the correlation r, and the y-intercept depends on the slope. Regression Formulas for y-Intercept and Slope

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 35 Calculating the slope and y-intercept for the regression line Using the baseball data in Example 9 to illustrate the calculations. The regression line to predict team scoring from batting average is.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 36 Correlation:  Describes the strength of the linear association between 2 variables.  Does not change when the units of measurement change.  Does not depend upon which variable is the response and which is the explanatory. The Slope and the Correlation

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 37 Slope:  Numerical value depends on the units used to measure the variables.  Does not tell us whether the association is strong or weak.  The two variables must be identified as response and explanatory variables.  The regression equation can be used to predict values of the response variable for given values of the explanatory variable. The Slope and the Correlation

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 38 The typical way to interpret is as the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x. When a strong linear association exists, the regression equation predictions tend to be much better than the predictions using only. We measure the proportional reduction in error and call it,. The Squared Correlation

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 39 measures the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x. A correlation of.9 means that  81% of the variation in the y-values can be explained by the explanatory variable, x. The Squared Correlation

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 3 Association: Contingency, Correlation, and Regression Section 3.4 Cautions in Analyzing Associations

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 41 Extrapolation: Using a regression line to predict y-values for x-values outside the observed range of the data.  Riskier the farther we move from the range of the given x-values.  There is no guarantee that the relationship given by the regression equation holds outside the range of sampled x-values. Extrapolation Is Dangerous

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 42 One reason to plot the data before you do a correlation or regression analysis is to check for unusual observations. Search for observations that are regression outliers, being well removed from the trend that the rest of the data follow. Be Cautious of Influential Outliers

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 43 A regression outlier is an observation that lies far away from the trend that the rest of the data follows. An observation is influential if  its x value is relatively low or high compared to the remainder of the data.  the observation is a regression outlier. Outliers and Influential Points Influential observations tend to pull the regression line toward that data point and away from the rest of the data points.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 44 Figure 3.18 An Observation Is a Regression Outlier if it is Far Removed from the Trend that the Rest of the Data Follow. The top two points are regression outliers. Not all regression outliers are influential in affecting the correlation or slope. Question: Which regression outlier in this figure is influential? Outliers and Influential Points

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 45 In a regression analysis, suppose that as x goes up, y also tends to go up (or down). Can we conclude that there’s a causal connection, with changes in x causing changes in y?  A strong correlation between x and y means that there is a strong linear association that exists between the two variables.  A strong correlation between x and y, does not mean that x causes y to change. Correlation Does Not Imply Causation

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 46 Data are available for all fires in Chicago last year on x = number of firefighters at the fire and y = cost of damages due to the fire. 1. Would you expect the correlation to be negative, zero, or positive? 2. If the correlation is positive, does this mean that having more firefighters at a fire causes the damages to be worse? Yes or No? 3. Identify a third variable that could be considered a common cause of x and y:  Distance from the fire station  Intensity of the fire  Size of the fire Correlation Does Not Imply Causation

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 47 A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest.  Ice cream sales and drowning – lurking variable = temperature  Reading level and shoe size – lurking variable = age  Childhood obesity rate and GDP-lurking variable = time When two explanatory variables are both associated with a response variable but are also associated with each other, there is said to be confounding. Lurking variables are not measured in the study but have the potential for confounding. Lurking Variables & Confounding

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 48 Simpson’s Paradox: When the direction of an association between two variables changes after we include a third variable and analyze the data at separate levels of that third variable. Simpson’s Paradox

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 49 Simpson’s Paradox Example: Smoking and Health Probability of Death of Smoker = 139/582= 24% Probability of Death of Nonsmoker = 230/732= 31% This can’t be true that smoking improves your chances of living! What’s going on?! Is Smoking Actually Beneficial to Your Health? Table 3.7 Smoking Status and 20-Year Survival in Women

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 50 Break out Data by Age Table 3.8 Smoking Status and 20-Year Survival, for Four Age Groups Simpson’s Paradox Example: Smoking and Health

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 51 For instance, for smokers of age 18–34, from Table 3.8 the proportion who died was 5/( ) = 0.028, or 2.8% Could age explain the association? Table 3.9 Conditional Percentages of Deaths for Smokers and Nonsmokers, by Age Simpson’s Paradox Example: Smoking and Health

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 52 An association can look quite different after adjusting for the effect of a third variable by grouping the data according to the values of the third variable (age). Simpson’s Paradox Example: Smoking and Health Figure 3.23 MINITAB Bar Graph Comparing Percentage of Deaths for Smokers and Nonsmokers, by Age. This side-by-side bar graph shows the conditional percentages from Table 3.9.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 53 Lurking variables can affect associations in many ways. For instance, a lurking variable may be a common cause of both the explanatory and response variable. In practice, there’s usually not a single variable that causally explains a response variable or the association between two variables. More commonly, there are multiple causes. When there are multiple causes, the association among them makes it difficult to study the effect of any single variable. The Effect of Lurking Variables on Associations

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 54 When two explanatory variables are both associated with a response variable but are also associated with each other, confounding occurs. It is difficult to determine whether either of them truly causes the response because a variable’s effect could be at least partly due to its association with the other variable. The Effect of Confounding on Associations