# Correlation and Regression. Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes.

## Presentation on theme: "Correlation and Regression. Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes."— Presentation transcript:

Correlation and Regression

Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes smoked per day Score on SAT Height Hours of Training Explanatory (Independent) Variable Response (Dependent) Variable A quantitative relationship between two interval or ratio level variables Number of Accidents Shoe SizeHeight Lung Capacity Grade Point Average IQ

Correlation  measures and describes the strength and direction of the relationship  Bivariate techniques requires two variable scores from the same individuals (dependent and independent variables)  Multivariate when more than two independent variables (e.g effect of advertising and prices on sales)  Variables must be ratio or interval scale

Negative Correlation–as x increases, y decreases x = hours of training (horizontal axis) y = number of accidents (vertical axis) Scatter Plots and Types of Correlation 60 50 40 30 20 10 0 02468 1214161820 Hours of Training Accidents

Positive Correlation–as x increases, y increases x = SAT score y = GPA GPA Scatter Plots and Types of Correlation 4.00 3.75 3.50 3.00 2.75 2.50 2.25 2.00 1.50 1.75 3.25 300350400450500550600650700750800 Math SAT

No linear correlation x = height y = IQ Scatter Plots and Types of Correlation 160 150 140 130 120 110 100 90 80 606468727680 Height IQ

Strong, negative relationship but non-linear! Scatter Plots and Types of Correlation

Correlation Coefficient A measure of the strength and direction of a linear relationship between two variables The range of r is from –1 to 1. If r is close to 1 there is a strong positive correlation. If r is close to –1 there is a strong negative correlation. If r is close to 0 there is no linear correlation. –1 0 1

Outliers..... Outliers are dangerous Here we have a spurious correlation of r=0.68 without IBM, r=0.48 without IBM & GE, r=0.21

x y 8 78 2 92 5 90 12 58 15 43 9 74 6 81 Absences Final Grade Application 95 90 85 80 75 70 65 60 55 45 40 50 0246810121416 Final Grade X Absences

6084 8464 8100 3364 1849 5476 6561 624 184 450 696 645 666 486 57516375157939898 1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81 64 4 25 144 225 81 36 xy x 2 y2y2 Computation of r x y

r is the correlation coefficient for the sample. The correlation coefficient for the population is (rho). The sampling distribution for r is a t-distribution with n – 2 d.f. Standardized test statistic For a two tail test for significance: Hypothesis Test for Significance (The correlation is not significant) (The correlation is significant)

A t-distribution with 5 degrees of freedom Test of Significance The correlation between the number of times absent and a final grade r = –0.975. There were seven pairs of data.Test the significance of this correlation. Use = 0.01. 1. Write the null and alternative hypothesis. 2. State the level of significance. 3. Identify the sampling distribution. (The correlation is not significant) (The correlation is significant) = 0.01

t 0 4.032 –4.032 Rejection Regions Critical Values ± t 0 4. Find the critical value. 5. Find the rejection region. 6. Find the test statistic. df\p0.400.250.100.050.0250.010.0050.0005 10.3249201.0000003.0776846.31375212.7062031.8205263.65674636.6192 20.2886750.8164971.8856182.9199864.302656.964569.9248431.5991 30.2766710.7648921.6377442.3533633.182454.540705.8409112.9240 40.2707220.7406971.5332062.1318472.776453.746954.604098.6103 50.2671810.7266871.4758842.0150482.570583.364934.032146.8688

t 0 –4.032 t = –9.811 falls in the rejection region. Reject the null hypothesis. There is a significant negative correlation between the number of times absent and final grades. 7. Make your decision. 8. Interpret your decision.

The equation of a line may be written as y = mx + b where m is the slope of the line and b is the y- intercept. The line of regression is: The slope m is: The y-intercept is: Regression indicates the degree to which the variation in one variable X, is related to or can be explained by the variation in another variable Y Once you know there is a significant linear correlation, you can write an equation describing the relationship between the x and y variables. This equation is called the line of regression or least squares line. The Line of Regression

180 190 200 210 220 230 240 250 260 1.52.02.53.0 Ad \$ = a residual (xi,yi)(xi,yi) = a data point revenue = a point on the line with the same x-value Best fitting straight line

Calculate m and b. Write the equation of the line of regression with x = number of absences and y = final grade. The line of regression is:= –3.924x + 105.667 6084 8464 8100 3364 1849 5476 6561 624 184 450 696 645 666 486 57 516375157939898 1 8 78 2 2 92 3 5 90 4 12 58 5 15 43 6 9 74 7 6 81 64 4 25 144 225 81 36 xy x 2 y2y2 x y

0246810121416 40 45 50 55 60 65 70 75 80 85 90 95 Absences Final Grade m = –3.924 and b = 105.667 The line of regression is: Note that the point = (8.143, 73.714) is on the line. The Line of Regression

The regression line can be used to predict values of y for values of x falling within the range of the data. The regression equation for number of times absent and final grade is: Use this equation to predict the expected grade for a student with (a) 3 absences(b) 12 absences (a) (b) Predicting y Values = –3.924(3) + 105.667 = 93.895 = –3.924(12) + 105.667 = 58.579 = –3.924x + 105.667

The correlation coefficient of number of times absent and final grade is r = –0.975. The coefficient of determination is r 2 = (–0.975) 2 = 0.9506. Interpretation: About 95% of the variation in final grades can be explained by the number of times a student is absent. The other 5% is unexplained and can be due to sampling error or other variables such as intelligence, amount of time studied, etc. Strength of the Association The coefficient of determination, r 2, measures the strength of the association and is the ratio of explained variation in y to the total variation in y.

Download ppt "Correlation and Regression. Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes."

Similar presentations