Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 4. Correlation and Regression Correlation is a technique that measures the strength of the relationship between two continuous variables. For.

Similar presentations


Presentation on theme: "Chapter 4. Correlation and Regression Correlation is a technique that measures the strength of the relationship between two continuous variables. For."— Presentation transcript:

1 Chapter 4

2 Correlation and Regression Correlation is a technique that measures the strength of the relationship between two continuous variables. For example, we could measure how strong the relationship is between people’s heights and their weights. A chart that defines a person’s “ideal” weight for a given height is constructed through the statistical technique called regression. Regression is a statistical technique that produces a model of the relationship (correlation) between two variables.

3 Linear Correlation Why would we want to know if there is a relationship between two variables? Possible correlation questions: Is there a relationship between a person’s income and intelligence? Is there a relationship between a country’s food supply and mortality rate? Is there a relationship between the average length of schooling for citizens in a country and the country’s life expectancy? Linear correlation will help us determine if there is a relationship, and how strong, or weak, that relationship is.

4 The Scatter Diagram Consider a sample of 12 randomly selected females attending Nassau Community College. We measure each female’s height and weight. Height and weight are the two continuous variables. We’ll label the height variable “x” and the weight variable “y.” For each female, we have a pair of numbers, height and weight, x and y. The pair of numbers can also be written as (x,y) which is called an ordered pair. The ordered pair (63, 123) would indicate that this student has height 63 inches and weight 123 pounds.

5

6 The Scatter Diagram A scatter diagram is a graph representing the ordered pairs of data on a set of axes. We start with two lines, a horizontal and vertical line, to represent the two axes. The x-axis (horizontal) represents the x-values; these are the heights. The x-axis is labeled “height” The y-axis (vertical) represents the y-values; these are the weights. The y-axis is labeled “weight.”

7

8 The Scatter Diagram We place a “dot” (a point) on the diagram to represent each ordered pair of measurements. For the first female with the ordered pair (62, 123), We go to 62 on the x-axis (height) Then up to 123 on the y-axis (weight)

9 The Scatter Diagram Do you think that the scatter diagram shows a relationship between a female’s height and her weight? Yes. Visual inspection of a scatter diagram can help to determine whether there is an apparent relationship (correlation) between the two variables and what type of relationship this is. There are 3 basic types of relationship we will encounter. Positive correlation Negative correlation No linear correlation

10 Positive correlation A positive correlation between two variables, x and y, occurs when high measurements for the x variable tend to be associated with high measurements for the y-variable and Low measurements for the x-variable tend to be associated with low measurements for the y-variable. The female height/weight example is an example of positive correlation because High measurements of the x-variable (height) tend to be associated with high measurements of the y-variable (weight) And low measurements for height tend to be associated with low measurements for weight.

11 Positive correlation The appearance of positive correlation is one in which the points move up towards the right of the scatter diagram. If we approximate a line through the dots of the scatter diagram, we can see that they follow a straight-line path. A linear relationship has a graph is forms a line.

12 Negative correlation A negative correlation between two variables, x and y, occurs when high measurements for the x variable tend to be associated with low measurements for the y-variable and Low measurements for the x-variable tend to be associated with high measurements for the y-variable.

13 No Linear Correlation No linear correlation means there is no linear relationship between the two variables. That is, high and low measurements for the two variables are not associated in any predictable straight line pattern.

14 Scatter Diagram on the TI 83/84 Put the x values into L1 on your calculator. (STAT->EDIT) Put the y-values into L2.

15 Scatter Diagram on the TI 83/84 Turn on STAT PLOT (2 nd Stat Plot) make sure only one stat plot is on Choose the scatter diagram from the Type menu. Xlist should be the list containing the x values. Ylist should be the list containing the y values.

16 Scatter Diagram on the TI 83/84 Clear out data from Y= Click Zoom-> 9: Zoom Stat Do these variables, x and y have a positive correlation, negative correlation, or no linear correlation? Positive correlation.

17 Example 4.2 Use the sample data to construct a scatter diagram on your calculator. Indicate the type of correlation, if any exist.

18 Example 4.2 As we examine the scatter diagram from left to right, the pattern of the points are going in a downward direction. As the values for x increase, the values for y decrease. The scatter diagram indicates a negative correlation.

19 The Coefficient of Linear Correlation What type of correlation is shown in these two scatter diagrams? Both show negative correlations. What is the difference between them? The negative correlation for the scatter diagram on the right is stronger than that on the left. The closer the points of the scatter diagram approximate a straight line, the stronger the linear correlation.

20 The Coefficient of Linear Correlation To measure how close the points on a scatter diagram come to forming a straight line, we use the following formula: r, is Pearson’s Correlation Coefficient, or just correlation coefficient, and it measures the strength of a linear relationship between two variables for a sample. x represents the data values for the first variable y represents the data values for the second variable n represents the number of pairs of data values

21 The Coefficient of Linear Correlation The values for r can range from -1 to 1. A value of r=1 represents the strongest positive linear correlation possible. It indicates a perfect positive linear correlation. This means that all points of the scatter diagram will lie on a straight line which is sloping upward from left to right. A value of r= -1 represents the strongest negative linear correlation possible. It indicates a perfect negative linear correlation. This means that all points of the scatter diagram will lie on a straight line which is sloping downward from left to right. A value of r=0 represents no linear correlation between the two variables.

22 The Coefficient of Linear Correlation r = 1 r = -1 r = 0 r ≈ -.98r ≈.97

23 Correlation Coefficient on the Calculator You can use the formula for r as shown on pages 177 through 179, but an equivalent method is to use LinReg on the calculator. Before starting, you must turn “Diagnostics On.” Once turned on, you won’t have to adjust this setting again unless you reset your calculator. 2 nd -> Catalog -> D -> DiagnosticOn -> Enter -> Enter

24 Example 4.5 Use the sample data in the table to calculate the sample correlation coefficient, r. 1. Put the x values in L1 and the y values in L2. 2. Press STAT -> CALC -> 4: LinReg(ax+b) -> ENTER 3. Enter the two lists seperated by a comma. LinReg (L1, L2) 4. Enter

25 Example 4.5 The correlation coefficient, r, is -.9692312624 Remember, the correlation coefficient is a number between -1 and 1 and represents how strong a linear relationship the two variables have. The closer the number is to 1, the stronger the positive linear relationship. The closer to -1, the stronger the negative linear relationship. r≈-.97 is close to -1 and represents a strong negative linear relationship.

26 Real World Application Use the data in the table to calculate the correlation coefficient, r, to measure the strength of the relationship between the two variables. The data in the table is from 2005 and was gathered from the Earth Trends web site: http://earthtrends.wri.org/

27 CountryAverage length of schooling (in years) Life expectancy Australia2080.4 Bolivia1463.9 Botswana1246.6 China1172.0 Ethiopia650.7 Iraq1057.1 Mexico1374.9 India1062.9 Romania1471.3 Rwanda943.4 South Africa1353.4 Spain1680.0 Sweden1680.1 United States1677.4

28 Real World Application Rounded to 2 decimal places, the correlation coefficient is r ≈.76. What type of correlation is this? Because r is positive, this is a positive correlation. To get a better view of the data, look at the scatter diagram.

29 Real World Application We can see that the dots are moving upward as we look at this diagram from left to right, But it is not a perfect positive correlation because the dots do not form a straight line. Very, very rarely will real-world variables form a perfect linear relationship.

30 Linear Regression Analysis In the real world application, we saw that a positive linear correlation exists between a country’s average schooling length and life expectancy. What if we wanted to estimate a country’s life expectancy by simply knowing the average length of schooling? Knowing that there is a significant linear correlation from the sample data, we can create a line that “best fits” the sample data. Then we can use the line to estimate other values for countries not part of the sample. A linear model (equation of a line) can be developed to predict a value for the dependent variable (y) given a value for the dependent variable (x).

31 Linear Regression Analysis For example, a strong positive correlation has been shown to exist between high school students’ standardized test results and success the first year of college as measured by the students’ GPAs. By creating a linear model (equation of a line), we can predict the 1 st year college success of a student with particular standardized test score.

32 Linear Regression Analysis Linear regression analysis provides us with a linear model (an equation) that can be used to predict the value of the y variable (college GPA) given the value of the x variable (standardized test scores). The predicted value for y may not be exactly correct, but it will be a “close” estimate. The line that is created by linear regression analysis is the “best fit” line between the points that is positioned closely among all the sample points. The line that is created is called the regression line.

33 Linear Regression Analysis Regression line formula: y’ = ax + b Where y’ is the predicted value of y, the dependent variable given the value of x, the independent variable. a and b are regression coefficients obtained by the formula:

34 Real World Application The data in the following table is from 2005 and was gathered from the Earth Trends web site: http://earthtrends.wri.org/ CountryAverage length of schooling (in years) Life expectancy Australia2080.4 Bolivia1463.9 Botswana1246.6 China1172.0 Ethiopia650.7 Iraq1057.1 Mexico1374.9 India1062.9 Romania1471.3 Rwanda943.4 South Africa1353.4 Spain1680.0 Sweden1680.1 United States1677.4

35 Real World Example We’ve shown that there is a positive correlation between the average length of schooling and life expectancy for a country’s population. The data pairs in the previous table represent data from a random sample of 12 countries. Use the sample to develop a regression line to prediction the life expectancy given the average length of schooling of a country. Use this line to predict the life expectancy for a country whose average length of schooling is: 15 years 17 years Graph the scatter plot and regression line together.

36 Real World Example Which variable, schooling years or life expectancy, do we want to predict? This is the dependent variable. We want to predict life expectancy. It makes sense to try to predict the life expectancy for a given length of schooling. Values to be predicted (dependent variable) = life expectancies = y Values given (independent variable) = length of schooling = x

37 Real World Example The formulas for finding a and b are lengthy. To avoid errors that can occur by doing the calculations by hand, we’ll use the calculator to find a and b. Put all the x values into L1 and all the y values into L2. STAT -> CALC -> 4: LinReg(ax+b) -> ENTER Remember to always use the order LinReg x-list, y-list

38 Real World Example The values for a and b that you get can be rounded to 2 decimal places. a = 2.79 b = 29.45 The equation of the regression line, y’=ax+b, becomes: y’= 2.79x + 29.45

39 Real World Example For the regression line found y’= 2.79x + 29.45 predict the life expectancy for a country if the average length of schooling is: (a) 15 years (b) 17 years (a) The average length of schooling is the x-variable, so we will substitute 15 for x in our equation: y’= 2.79x + 29.45 y’= 2.79(15) + 29.45 = 71.3 The predicted life expectancy for a country where the average length of schooling is 15 years is about 71.3 years.

40 Real World Example (b) 17 years (b) The average length of schooling is the x-variable, so we will substitute 17 for x in our equation: y’= 2.79x + 29.45 y’= 2.79(17) + 29.45 = 76.88 The predicted life expectancy for a country where the average length of schooling is 17 years is about 78.9 years.

41 Real World Example Use the regression line equation and the sample data pairs from example the real world example to graph, on the same axes, the scatter diagram of the sample data and the regression line. Make sure all the setting for STAT Plot are correct and are using the 2 lists used for this problem. You can check the scatter diagram first by pressing Zoom->9: Zoom Stat

42 Real World Example Press “Y=“ at the top left. Put in the equation 2.79x + 29.45 Press Graph at the top right. Notice that the regression line “best fits” the sample data.

43 The Coefficient of Determination We have shown that there is a positive linear correlation between the average length of schooling and life expectancy of a country’s population. But there are also other factors that influence the life expectancy that exist outside of our data. The degree of influence that one variable (schooling) has on another variable (life expectancy) can be found with a number called the coefficient of determination.

44 The Coefficient of Determination In other words, how much of an influence does average schooling length have on life expectancy? The answer to this question will be a percentage, visually shown here as the part of the pie chart in blue. The coefficient of determination measures the proportion of the variance of the dependent variable y that can be accounted for by the variance of the independent variable x. Simply put, how much does y (life expectancy) depend on x (average length of schooling)? We find the coefficient of determination by squaring the coefficient of correlation, r.

45 The Coefficient of Determination In our example, r=0.76. The coefficient of determination: Expressed as a percentage, r 2 = 58% To interpret the meaning of the coefficient determination, we can form the following general explanation: ___% of the variability in (dependent variable y) can be accounted for by the variability in (independent variable x). 58% of the variability in a country’s life expectancy can be accounted for by the variability in the average length of schooling.

46 The Coefficient of Determination The coefficient of determination, r 2 = 58%, suggests that there is some other reasons why a country’s life expectancy is a certain amount. Since the coefficient of determination is 58%, we may conclude that the remaining 42% of variability is due to other unexplained factors. The unexplained amount is out of the scope of the problem. We can just accept that there are other factors that contribute to the variable life expectancy.

47 A note of caution regarding the interpretation of correlation results Two variables may have a significant linear relationship, but it doesn’t imply that there is a cause-and-effect relationship. In other words, the presence of one variable does not (necessarily) cause the presence of the variable. For example, the number of storks nesting in various European towns in the early 1900’s and the number of human babies born in the same towns during this period have a very high correlation. However, we can’t conclude that an increase in storks will cause an increase in babies (or vice versa). A significant linear correlation should not be interpreted to mean that a change in one variable caused a change in the other variable. Rather, changes in one variable are accompanied by changes in the other variable.


Download ppt "Chapter 4. Correlation and Regression Correlation is a technique that measures the strength of the relationship between two continuous variables. For."

Similar presentations


Ads by Google