Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Chapter 4 The Relation between Two Variables Prof. Felix Apfaltrer Office:N518 Phone: 212-220 8000X 7421 Office.

Similar presentations


Presentation on theme: "1 Chapter 4 The Relation between Two Variables Prof. Felix Apfaltrer Office:N518 Phone: 212-220 8000X 7421 Office."— Presentation transcript:

1 1 Chapter 4 The Relation between Two Variables Prof. Felix Apfaltrer Office:N518 Phone: X 7421 Office hours: Tue/Thu 1:30-3pm

2 Mathematical model is a mathematical expression that represents some phenomenon. It can be deterministic model or probabilistic model Often describe the relationship between 2 variables.

3 Learning objectives Draw and interpret scatter diagrams Understand the properties of the linear correlation coefficient Compute and interpret the linear correlation coefficient 1 2 3

4 4.1. Scatter Diagrams and Correlation When dealing with 2 variables: We try to see the relationship between the 2 variables Sometimes there is a 3 rd variable that is not considered, that affects the results (lurking variable). Shoe size does not cause height to change (age affects both the two variables) Therefore, we can’t conclude that variable A causes B Some examples are: Rainfall amounts and plant growth (possible lurking var. Sunlight) Exercise and cholesterol levels for a group of people (possible lurking var. Diet) Height and weight for a group of people Height and fast speed you have ever driven a car. When we have two variables, they could be related in one of several different ways  They could be unrelated  One variable (the explanatory or predictor variable) could be used to explain the other (the response or dependent variable)  One variable could be thought of as causing the other variable to change

5 Scatter Diagrams The scatter diagram is a graph that shows the relationship visually between 2 quantitative variables. The explanatory variable is plotted on the horizontal axis, the response variable on the vertical axis The response variable (y-axis) is the variable whose value can be explained by the value of the explanatory variable (x-axis).

6 The linear correlation coefficient is a measure of the strength and direction of linear relation between two quantitative variables The sample correlation coefficient “r” is This should be computed with software (and not by hand) whenever possible Linear Correlation

7 Answer ‘How Strong Is the Linear Relationship Between 2 Variables?’ Coefficient of Correlation Used  Population Correlation Coefficient Denoted  (Rho)  Values Range from -1 to +1  Measures Degree of Association The sign of r indicates the direction of the relationship: Positive the two variables tend to increase together. Negative one variable increases, the other is likely to decrease. Used Mainly for Understanding

8 No Correlation +1.0 Increasing degree of negative correlation Increasing degree of positive correlation Perfect Negative Correlation Perfect Positive Correlation

9 Examples of positive correlation Strong Positive r =.8 Moderate Positive r =.5 Very Weak r =.1 Strong Negative r = –.8 Moderate Negative r = –.5 Very Weak r = –.1 In general, if the correlation is visible to the eye, then it is likely to be strong Examples of negative correlation

10 n  xy – (  x)(  y) n(  x 2 ) – (  x) 2 n(  y 2 ) – (  y) 2 r =r =  48  – (10)(20) 4(36) – (10) 2 4(120) – (20) 2 r =r = Data x y – r =r = = – (Shorcut formula)

11 Correlation is not causation! Just because two variables are correlated does not mean that one causes the other to change There is a strong correlation between shoe sizes and vocabulary sizes for grade school children –Clearly larger shoe sizes do not cause larger vocabularies –Clearly larger vocabularies do not cause larger shoe sizes Often lurking variables result in confounding

12 Summary: Chapter 4 – Section 1 Visual methods –Scatter diagrams –Analogous to histograms for single variables Numeric methods –Linear correlation coefficient –Analogous to mean and variance for single variables Care should be taken in the interpretation of linear correlation (nonlinearity and causation) Correlation between two variables can be described with both visual and numeric methods

13 Learning objectives Find the least-squares regression line and use the line to make predictions and estimations Interpret the slope and the y-intercept of the least squares regression line Compute the sum of squared residuals Chapter 4 – Section 2

14

15 If we have two variables X and Y, we often would like to model the relation as a line Draw a line through the scatter diagram We want to find the line that “best” describes the linear relationship … the regression line

16 We want to use a linear model Linear models can be written in several different (equivalent) ways  y = m x + b  y – y 1 = m (x – x 1 )  y = b 1 x + b 0 Because the slope and the intercept are important to analyze, we will use y = b 1 x + b 0 Linear Equations

17 BMCC PROFESSOR Linear Equations

18 The model line The x value of interest The observed value y The residual The predicted value y What the residual is on the scatter diagram One difference between math and stat is that statistics assumes that the measurements are not exact, that there is an error or residual The formula for the residual is always Residual = Observed – Predicted The equation for the least-squares regression line is given by y = b 1 x + b 0  b 1 is the slope of the least- squares regression line (marginal change)  b 0 is the y-intercept of the least-squares regression line

19 x y y = x ^ Least-Squares Property A straight line satisfies this property if the sum of the squares of the residuals is the smallest sum possible.

20 calculators or computers can compute these values n(  xy) – (  x) (  y) b 1 = (slope) n(  x 2 ) – (  x) 2 b 0 = y – b 1 x ( y -intercept) (slope of the least-squares regression line) (Shorcut)

21 Finding the values of b 1 and b 0, by hand, is a very tedious process You should use software for this Finding the coefficients b 1 and b 0 is only the first step of a regression analysis  We need to interpret the slope b 1  We need to interpret the y-intercept b 0  We need to do quite a bit more statistical analysis … this is covered in Section 4.3 and also in Chapter 14

22 Data x y n(  xy) – (  x) (  y) n(  x 2 ) –(  x) 2 b 1 = 4(48) – (10) (20) 4(36) – (10) 2 b 1 = –8 44 b 1 = = – n = 4  x = 10  y = 20  x 2 = 36  y 2 = 120  xy = 48 b 0 = y – b 1 x 5 – (– )(2.5) = 5.45 The estimated equation of the regression line is: y = 5.45 – x ^

23 1. If there is no significant linear correlation, don’t use the regression equation to make predictions. 2. When using the regression equation for predictions, stay within the scope of the available sample data. 3. A regression equation based on old data is not necessarily valid now. 4. Don’t make predictions about a population that is different from the population from which the sample data was drawn. Guidelines for Using The Regression Equation

24 Learning objectives –Compute and interpret the coefficient of determination –Perform residual analysis on a regression model –Identify influential observations Chapter 4 – Section 3 The relationship is The larger the explained deviation, the better the model is at prediction / explanation The larger the unexplained deviation, the worse the model is at prediction / explanation Total Deviation = Explained Deviation + Unexplained Deviation

25 We began with y – y or the total deviation Our regression model reduces this to y – y or the unexplained deviation The amount of reduction y – y is the explained deviation

26 Instead of straight deviations, we use variations Variation = Deviation 2 It is also true that A measure of the explanatory power of the model is the proportion of variation that is explained: Total Variation = Explained Variation + Unexplained Variation

27 Total sum of squares (Y -  Y) 2 ^ Explained sum of squares (Y -  Y) 2 ^ Y Unexplained sum of squares (Y -  Y) 2

28 r 2 is called coefficient of determination. Proportion of Variation ‘Explained’ by Relationship Between X & Y Simply Square Correlation r 0  r 2  1 (%) (percentage explained by X)

29 How can we tell how good is our model? To check to see if a linear model is appropriate, plot the residuals (error) on the vertical axis against the explanatory variable (the x) on the horizontal axis If the plot shows a pattern (such as a curve), then the response (y) and explanatory (x) variables may have a nonlinear relationship If there is no obvious pattern, we could be ok …

30 Two example residual plots If there is a spread (the dotted blue line), then a linear relationship is not very reliable No spread Spread The least-squares regression model assumes that the variance of the residuals are constant across values of the explanatory variable To check to see if the variance of the residuals are constant, plot the residuals (error) on the vertical axis against the explanatory variable (the x) on the horizontal axis This is the same plot as the plot checking linearity

31 Outliers for a least-squares regression are those observations that are unusually far away from the model line There are several ways to identify outliers  The scatter diagram may show the outlier as a point away from the main pattern of points  The residual plot may show the outlier as a unusually high or unusually low residual  The boxplot of residuals may identify the outlier as a value outside the upper or lower fence Definition

32 Three ways to identify outliers –From a scatter diagram –From a residual plot –From a boxplot

33 ●Influential Points: An influential point strongly affects the graph of the regression line ●Usually influential observations are those with unusually high or unusually low values of the predictor (x) variable A significant affect on the value of the slope, or A significant affect on the value of the intercept The x value is large compared to the others It is not along the general linear pattern of the data The x and y values are large compared to the others However, it is along the general linear pattern of the data It is not along the general linear pattern of the data However, it is likely not to be influential outlier influential definitely influential

34 If a particular observation is influential, we should investigate that observation If the observation is a valid observation, we have a variety of options  We could collect additional points near the influential observation  We could collect additional points between the main part of our data and the influential point (to check whether the data is nonlinear, for example)  We could use techniques that are resistant to influential observations

35 Diagnostics are very important in assessing the quality of a least-squares regression model –The coefficient of determination measures the percent of total variation explained by the model –The plot of residuals can detect nonlinear patterns, error variances that are not constant, and outliers –We must be careful when there are influential observations because they have an unusually large effect on the computation of our model parameters Summary: Chapter 4 – Section 3


Download ppt "1 Chapter 4 The Relation between Two Variables Prof. Felix Apfaltrer Office:N518 Phone: 212-220 8000X 7421 Office."

Similar presentations


Ads by Google