Presentation on theme: "Ch 2 and 9.1 Relationships Between 2 Variables"— Presentation transcript:
1 Ch 2 and 9.1 Relationships Between 2 Variables More than one variable can be measured on each individual.Examples:Gender and HeightSize and CostEye color and MajorWe want to look at the relationship among these variables.Is there an association between these two variables?Two variables measured on the same individuals are associated if some values tend to occur more often with some values of the second variable than with other values of that variable.
2 Relationships Between 2 Variables If we expect one variable to influence another, we call it the ___________ variable.Explains or influences changes in the response variableThe variable that is influenced is called the ____________ variable.Measures an outcome of a studyIn each of the following examples, identify the explanatory and response variablesGender and blood pressureClass attendance and course gradeNumber of beers and BAC
3 Relationships Between 2 Variables We may be interested in relationships of different types of variables.Categorical and NumericCategorical and CategoricalNumeric and Numeric
4 Relationships between Categorical and Numeric Variables We are interested in comparing the numerical variable across each of the levels of the categorical variable.Examples:Compare high speeds for 4 different car brandsCompare sucrose levels for 5 different types of fruitCompare GPR for 20 different majors
5 Relationships between Categorical and Numeric Variables Graphical ComparisonExample: Sucrose levels of fruits (fictitious data)
6 Relationships between Categorical and Numeric Variables Numerical ComparisonWe could also look at summary statistics for each group.
7 Ch 9.1 Relationships Between Two Categorical Variables Depending on the situation, one of the variables is the explanatory variable and the other is the response variable.In this case, we look at the percentages of one variable for each level of the other variable.Examples:Gender and Soda PreferenceCountry of Origin and Marital StatusSmoking Habits and Socioeconomic Status
8 Two-Way TablesTwo-way tables come about when we are interested in the relationship between two categorical variables.One of the variables is the _____________.The other is the _______________.The combination of a row variable and a column variable is a ______________.
10 Relationships between two categorical variables Example: Gender and Highest Degree ObtainedJoint Distribution: How likely are you to have a bachelor’s degree and be a male? _____________Marginal Distribution: What is the least likely highest degree obtained? _____________Conditional Distribution: If you are a female, how likely are you to have obtained a graduate degree? ______________
11 Relationships between two categorical variables Shows the percentagesfor the joint, marginal,and conditional distributions.
12 Ch 2 Relationships Between 2 Numeric Variables Depending on the situation, one of the variables is the explanatory variable and the other is the response variable.There is not always an explanatory-response relationship.Examples:Height and WeightIncome and AgeSAT scores on math exam and on verbal examAmount of time spent studying for an exam and exam score
13 Relationships between 2 numeric variables ScatterplotsLook for overall pattern and any striking deviations from that pattern.Look for outliers, values falling outside the overall pattern of the relationshipYou can describe the overall pattern of a scatterplot by the form, direction, and strength of the relationship.Form: Linear or clustersDirectionTwo variables are _____________________ when above-average values of one tend to accompany above-average values of the other and likewise below-average values also tend to occur together.Two variables are _____________________ when above-average values of one variable accompany below-average values of the other variable, and vice-versa.Strength-how close the points lie to a line
15 Relationships between 2 numeric variables Relationships between two numeric variablesExampleVehicle WeightHorsepower__________Association
16 Relationships between 2 numeric variables ___________ or r: measures the direction and strength of the linear relationship between two numeric variablesGeneral PropertiesIt must be between -1 and 1, or (-1≤ r ≤ 1).If r is negative, the relationship is negative.If r = –1, there is a perfect negative linear relationship (extreme case).If r is positive, the relationship is positive.If r = 1, there is a perfect positive linear relationship (extreme case).If r is 0, there is no linear relationship.r measures the strength of the linear relationship.If explanatory and response are switched, r remains the same.r has no units of measurement associated with itScale changes do not affect r
18 Relationships between 2 numeric variables Examples of extreme casesr = 1r = 0r = -1
19 Relationships between 2 numeric variables Match the correlation with to the scatterplotr = 0.04r =0.43r = -0.84r = 0.76r = 0.21
20 Relationships between 2 numeric variables It is possible for there to be a strong relationship between two variables and still have r ≈ 0.EX.
21 Relationships between 2 numeric variables Important notes:Association does not imply causationCorrelation does not imply causationSlope is not correlationA scale change does not change the correlation.Correlation doesn’t measure the strength of a non-linear relationship:
22 Regression LineA regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.A regression line summarizes the relationship between two variables, but only in a specific setting: when one of the variables helps explain or predict the other.We often use a regression line to predict the value of y for a given value of x.Regression, unlike correlation, requires that we have an explanatory variable and a response variable
23 Regression LineFitting a line to data means drawing a line that comes as close as possible to the points.Extrapolation-the use of a regression line for prediction far outside the range of values of the explanatory variable x that you used to obtain the line.Such predictions are often not accurate.
24 Least-Squares Regression Line The least-squares regression line of y on x is the line that makes the sum of squares of the vertical distances of the data points from the line as small as possible.These vertical distances are called the residuals, or the error in prediction, because they measure how far the point is from the line:where y is the point and is the predicted point.
25 Least-Squares Regression Line The equation of the least-squares regression line of y on x is
26 Least-Squares Regression Line The expression for slope, b1, says that along the regression line, a change of one standard deviation in x corresponds to a change of r standard deviations in y.The slope, b1, is the amount by which y changes when x increases by one unit.The intercept, b0, is the value of y whenThe least-squares regression line ALWAYS passes through the point
27 r2 in RegressionThe square of the correlation, r2, is the fraction of the variation in the values of y that is explained by the least-squares regression of y on x.Use r2 as a measure of how successfully the regression explains the response.Interpret r2 as the “percent of variation explained”For Simple Linear Regression, r2 is simply the square of the correlation coefficient.
28 Relationships between 2 numeric variables ExampleHow much of the variation is explainedby the least squares line of y on x? ______What is the correlation coefficient? ______Horsepower = *weight (Equation of the line.)__________: y-value or response (horsepower) when line crosses the y-axis._______: increase in response for a unit increase in explanatory variable.So if weight increases by one pound, horsepower increases by 0.04 units (on average).
29 Relationships between 2 variables Lurking Variable: A variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.Simpson’s Paradox: An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s Paradox. This can happen when a lurking variable is present. Please see Examples 9.9 and 9.10 in the text.
30 Outliers and Influential Observations in Regression An outlier is an observation that lies outside the overall pattern of the other observations.An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation.Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line.
31 Outliers and Influential Observations in Regression Child 18 is an outlier in the x direction. Because of its extreme position on the age scale, this point has a strong influence on the position of the regression line.r2 is also affected by the influential observation. With Child 18, r2 = 41%, but without Child 18, r2 = 11%. The apparent strength of the association was largely due to a single influential observation.The dashed line was calculated leaving out Child 18. The solid line is with Child 18.