 # Chapter 2: Looking at Data - Relationships /true-fact-the-lack-of-pirates-is-causing-global-warming/

## Presentation on theme: "Chapter 2: Looking at Data - Relationships /true-fact-the-lack-of-pirates-is-causing-global-warming/"— Presentation transcript:

Chapter 2: Looking at Data - Relationships http://www.forbes.com/sites/erikaandersen/2012/03/23 /true-fact-the-lack-of-pirates-is-causing-global-warming/ 1

General Procedure 1.Plot the data. 2.Look for the overall pattern. 3.Calculate a numeric summary. 4.Answer the question (which will be defined shortly) 2

2.1: Relationships - Goals Be able to define what is meant by an association between variables. Be able to categorize whether a variable is a response variable or a explanatory variable. Be able to identify the key characteristics of a data set. 3

Questions What objects do the data describe? What variables are present and how are they measured? Are all of the variables quantitative? Are the variables associated with each other? 4

Association (cont.) Two variables are associated if knowing the values of one of the variables tells you something about the values of the other variable. 1.Do you want to explore the association? 2.Do you want to show causality? 5

Variable Types Response variable (Y): outcome of the study Explanatory variable (X): explains or causes changes in the response variable 6

Key Characteristics of Data Cases: Identify what they are and how many Label: Identify what the label variable is (if present) Categorical or quantitative: Classify each variable as categorical or quantitative. Values. Identify the possible values for each variable. Explanatory or Response: Classify each variable as explanatory or response. 7

2.2: Scatterplots - Goals Be able to create a scatterplot (lab) Be able to interpret a scatterplot – Pattern – Outliers – Form, direction and strength of a relationship Be able to interpret scatterplots which have categorical variables. 8

Scatterplot - Procedure 1.Decide which variable is the explanatory variable and put on X axis. The response variable goes on the Y axis. 2.Label and scale your axes. 3.Plot the (x,y) pairs. 9

Example: Scatterplot The following data is to determine the relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment. a) Draw a scatterplot of this data. Obs1234567891011 Age7051657048704548354830 BP-28-10-8-15-8-10-1231-58 10

Example: Scatterplot (cont) Age 11

Pattern Form Direction Strength Outliers 12

Pattern Linear Nonlinear No relationship 13

Outliers 14

Example: Scatterplot (cont) Age 15

Scatterplot with Categorical Variables 16 http://statland.org/Software_Help/Minitab/MTBpul2.htm

I am a Turkey, not Tukey! Thank you for not eating me! 17

2.3: Correlation - Goals Be able to use (and calculate) the correlation to describe the direction and strength of a linear relationship. Be able to recognize the properties of the correlation. Be able to determine when (and when not) you can use correlation to measure the association. 18

Sample correlation, r (Pearson’s Sample Correlation Coefficient) 19

Sum of Squares 20

Properties of Correlation r > 0 ==> positive association r negative association r is always a number between -1 and 1. The strength of the linear relationship increases as |r| moves to 1. – |r| = 1 only occurs if there is a perfect linear relationship – r = 0 ==> x and y are uncorrelated. 21

Positive/Negative Correlation 22

Example: Positive/Negative Correlation 1) Would the correlation between the age of a used car and its price be positive or negative? Why? 2) Would the correlation between the weight of a vehicle and miles per gallon be positive or negative? Why? 23

Properties of Correlation r > 0 ==> positive association r negative association r is always a number between -1 and 1. The strength of the linear relationship increases as |r| moves to 1. – |r| = 1 only occurs if there is a perfect linear relationship – r = 0 ==> x and y are uncorrelated. 24

Variety of Correlation Values 25

Value of r 26

Properties of Correlation r > 0 ==> positive association r negative association r is always a number between -1 and 1. The strength of the linear relationship increases as |r| moves to 1. – |r| = 1 only occurs if there is a perfect linear relationship – r = 0 ==> x and y are uncorrelated. 27

Cautions about Correlation Correlation requires that both variables be quantitative. Correlation measures the strength of LINEAR relationships only. The correlation is not resistant to outliers. Correlation is not a complete summary of bivariate data. 29

Datasets with r = 0.816 30

Questions about Correlation Does a small r indicate that x and y are NOT associated? Does a large r indicate that x and y are linearly associated? 31

2.4: Least-Squares Regression - Goals Be able to generally describe the method of ‘Least Squares Regression’ Be able to calculate and interpret the regression line. Using the least square regression line, be able to predict the value of y for any appropriate value of x. Be able to calculate r 2. Be able to explain the meaning of r 2. – Be able to discern what r 2 does NOT explain. 32

Regression Line A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We can use a regression line to predict the value of y for a given value of x. 33

Idea of Linear Regression 34

Linear Regression b 0 = ȳ - b 1 x̄ 35 ŷ = b 0 + b 1 x

Example: Regression Line Age 36 ŷ = 20.11 - 0.526x

Example: Regression Line The following data is to determine the relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment. x̄ = 52.727, ȳ = -7.636, s x = 14.164, s y = 9.688, r = -0.76951 b) What is the regression line for this data? c) What would the predicted value be for someone who is 51 years old? Obs1234567891011 Age7051657048704548354830 BP-28-10-8-15-8-10-1231-58 37

Facts about Least Square Regression 38

r2r2 39

Example: Regression Line The following data is to determine the relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment. d) What percent of variation of Y is due to the regression line? Obs1234567891011 Age7051657048704548354830 BP-28-10-8-15-8-10-1231-58 40

Beware of interpretation of r 2 Linearity Outliers Good prediction 41

2.5: Cautions about Correlation and Regression - Goals Be able to calculate the residuals. Be able to use a residual plot to assess the fit of a regression line. Be able to identify outliers and influential observations by looking at scatterplots and residual plots. Be able to determine when you can predict a new value. Be able to identify lurking variables that can influence the relationship between two variables. Be able to explain the different between association and causation. 42

Residuals 43

Example: Regression Line The following data is to determine the relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment. e) What is the residual for someone who is 51 years old? Obs1234567891011 Age7051657048704548354830 BP-28-10-8-15-8-10-1231-58 44

Residual Plots Good Linearity Violation 45

Residual Plots Good Constant variance violation 46

Residual Plots – Bp OriginalY outlier 47

Residual Plots – Bp OriginalX outlier 48

Influential Point An outlier is an observation that lies outside the overall pattern of the other observations. An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. 49

Cautions about Correlation and Regression: Extrapolation 50

Cautions about Correlation and Regression: Both describe linear relationship. Both are affected by outliers. Always PLOT the data. Beware of extrapolation. Beware of lurking variables – Lurking variables are important in the study, but are not included. – Confounding variables confuse the issue. Correlation (association) does NOT imply causation! 51

Lurking Variables In each of these cases, identify the lurking variable. 1. For children, there is an extremely strong correlation between shoe size and math scores. 2. There is a very strong correlation between ice cream sales and number of deaths by drowning. 3. There is very strong correlation between number of churches in a town and number of bars in a town. 52

What is the lurking variable? http://www.forbes.com/sites/erikaandersen/2012/03/23 /true-fact-the-lack-of-pirates-is-causing-global-warming/ 53

2.6: Data Analysis for Two-Way Tables - Goals Statements The distribution of a two random variables (bivariate) is called a joint distribution. Two random variables are similar to two events in that they can have conditional probabilities and be independent of each other. Goal Interpret examples of Simpson’s paradox 54

Simpson’s Paradox An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox. 55

Simpson’s Paradox Consider the acceptance rates for the following groups of men and women who applied to college. 56

2.7: The Question of Causation - Goals Be able to explain an association – Causation – Common response – Confounding variables Apply the criteria for establishing causation. 58

Causation Association does not mean causation! 59

Establishing Causation Perform an experiment! What do we need for causation? 1.The association is strong. 2.The association is consistent. The connection happens in repeated trials The connection happens under varying conditions 3.Higher doses are associated with strong responses. 4.Alleged cause precedes the effect. 5.The alleged cause is plausible. 60