Correlation and regression II. Regression using transformations. 1.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Correlation and regression
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
13- 1 Chapter Thirteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Correlation and Regression
Objectives (BPS chapter 24)
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
Correlation and Regression. Spearman's rank correlation An alternative to correlation that does not make so many assumptions Still measures the strength.
Chapter 12 Simple Regression
Lecture 11 PY 427 Statistics 1 Fall 2006 Kin Ching Kong, Ph.D
Linear Regression and Correlation
The Simple Regression Model
SIMPLE LINEAR REGRESSION
Chapter Topics Types of Regression Models
Linear Regression and Correlation Analysis
Chapter 11 Multiple Regression.
Correlation and Regression. Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes.
REGRESSION AND CORRELATION
SIMPLE LINEAR REGRESSION
Correlation and Regression Analysis
Simple Linear Regression and Correlation
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Linear Regression/Correlation
Relationships Among Variables
Lecture 5 Correlation and Regression
Correlation & Regression
Lecture 16 Correlation and Coefficient of Correlation
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
Introduction to Linear Regression and Correlation Analysis
Correlation and Regression
Inference for regression - Simple linear regression
Correlation and Linear Regression
Correlation.
Correlation and Regression. The test you choose depends on level of measurement: IndependentDependentTest DichotomousContinuous Independent Samples t-test.
Chapter 15 Correlation and Regression
Anthony Greene1 Correlation The Association Between Variables.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
INTRODUCTORY LINEAR REGRESSION SIMPLE LINEAR REGRESSION - Curve fitting - Inferences about estimated parameter - Adequacy of the models - Linear.
Correlation, linear regression 1. Scatterplot Relationship between two continouous variables 2.
Introduction to Linear Regression
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Elementary Statistics Correlation and Regression.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Inference for Regression Chapter 14. Linear Regression We can use least squares regression to estimate the linear relationship between two quantitative.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
MARKETING RESEARCH CHAPTER 18 :Correlation and Regression.
Regression Analysis © 2007 Prentice Hall17-1. © 2007 Prentice Hall17-2 Chapter Outline 1) Correlations 2) Bivariate Regression 3) Statistics Associated.
Biostatistics, statistical software VI. Relationship between two continuous variables, correlation, linear regression, transformations. Relationship between.
Chapter 14 Correlation and Regression
Regression Analysis. 1. To comprehend the nature of correlation analysis. 2. To understand bivariate regression analysis. 3. To become aware of the coefficient.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
REGRESSION AND CORRELATION SIMPLE LINEAR REGRESSION 10.2 SCATTER DIAGRAM 10.3 GRAPHICAL METHOD FOR DETERMINING REGRESSION 10.4 LEAST SQUARE METHOD.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Inference about the slope parameter and correlation
Chapter 4 Basic Estimation Techniques
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Correlation and regression II. Regression using transformations. 1

2 Scatterplot Relationship between two continuous variables

3 Scatterplot Other examples

4 Example II. Imagine that 6 students are given a battery of tests by a vocational guidance counsellor with the results shown in the following table: Variables measured on the same individuals are often related to each other.

5 Possible relationships positive correlation negative correlation no correlation

6 Describing linear relationship with number: the coefficient of correlation (r). Also called Pearson coefficient of correlation Correlation is a numerical measure of the strength of a linear association. The formula for coefficient of correlation treats x and y identically. There is no distinction between explanatory and response variable. Let us denote the two samples by x 1,x 2,…x n and y 1,y 2,…y n, the coefficient of correlation can be computed according to the following formula

7 Karl Pearson Karl Pearson (27 March 1857 – 27 April 1936) established the discipline of mathematical statistics. /wiki/Karl_Pearson

8 Properties of r Correlations are between -1 and +1; the value of r is always between -1 and 1, either extreme indicates a perfect linear association. -1  r  1. a) If r is near +1 or -1 we say that we have high correlation. b) If r=1, we say that there is perfect positive correlation. If r= -1, then we say that there is a perfect negative correlation. c) A correlation of zero indicates the absence of linear association. When there is no tendency for the points to lie in a straight line, we say that there is no correlation (r=0) or we have low correlation (r is near 0 ).

9 Calculated values of r positive correlation, r=0.9989negative correlation, r= no correlation, r=

10 Scatterplot Other examples r=0.873 r=0.018

11 Correlation and causation a correlation between two variables does not show that one causes the other.

12 When is a correlation „high”? What is considered to be high correlation varies with the field of application. The statistician must decide when a sample value of r is far enough from zero, that is, when it is sufficiently far from zero to reflect the correlation in the population.

13 Testing the significance of the coefficient of correlation The statistician must decide when a sample value of r is far enough from zero to be significant, that is, when it is sufficiently far from zero to reflect the correlation in the population.

14 Testing the significance of the coefficient of correlation The statistician must decide when a sample value of r is far enough from zero to be significant, that is, when it is sufficiently far from zero to reflect the correlation in the population. This test can be carried out by expressing the t statistic in terms of r. H 0 : ρ=0 (greek rho=0, correlation coefficient in population = 0) H a : ρ ≠ 0 (correlation coefficient in population ≠ 0) Assumption: the two samples are drawn from a bivariate normal distribution If H0 is true, then the following t-statistic has n-2 degrees of freedom

15 Bivariate normal distributions ρ=0 ρ=0.4 Biostat 9. 15

16 Testing the significance of the coefficient of correlation The following t-statistic has n-2 degrees of freedom Decision using statistical table:  If |t|>t α,n-2, the difference is significant at α level, we reject H 0 and state that the population correlation coefficient is different from 0.  If |t|<t α,n-2, the difference is not significant at α level, we accept H 0 and state that the population correlation coefficient is not different from 0. Decision using p-value:  if p < α a we reject H 0 at α level and state that the population correlation coefficient is different from 0.

17 Example 1. The correlation coefficient between math skill and language skill was found r= Is significantly different from 0? H 0 : the correlation coefficient in population = 0, ρ =0. H a : the correlation coefficient in population is different from 0. Let's compute the test statistic: Degrees of freedom: df=6-2=4 The critical value in the table is t 0.05,4 = Because 42.6 > 2.776, we reject H 0 and claim that there is a significant linear correlation between the two variables at 5 % level.

18 Example 1, cont. p<0.05, the correlation is significantly different from 0 at 5% level

19 Example 2. The correlation coefficient between math skill and retailing skill was found r= Is significantly different from 0? H 0 : the correlation coefficient in population = 0, ρ =0. H a : the correlation coefficient in population is different from 0. Let's compute the test statistic: Degrees of freedom: df=6-2=4 The critical value in the table is t 0.05,4 = Because |-53.42|=53.42 > 2.776, we reject H 0 and claim that there is a significant linear correlation between the two variables at 5 % level.

20 Example 2., cont.

21 Example 3. The correlation coefficient between math skill and theater skill was found r= Is significantly different from 0? H 0 : the correlation coefficient in population = 0, ρ =0. H a : the correlation coefficient in population is different from 0. Let's compute the test statistic: Degrees of freedom: df=6-2=4 The critical value in the table is t 0.05,4 = Because | |= < 2.776, we do not reject H 0 and claim that there is no a significant linear correlation between the two variables at 5 % level.

22 Example 3., cont.

23 Significance of the correlation Other examples r=0.873, p< r=0.018, p=0.833

24 Prediction based on linear correlation: the linear regression When the form of the relationship in a scatterplot is linear, we usually want to describe that linear form more precisely with numbers. We can rarely hope to find data values lined up perfectly, so we fit lines to scatterplots with a method that compromises among the data values. This method is called the method of least squares. The key to finding, understanding, and using least squares lines is an understanding of their failures to fit the data; the residuals.

25 Residuals, example 1.

26 Residuals, example 2.

27 Residuals, example 3.

28 Prediction based on linear correlation: the linear regression A straight line that best fits the data: y=bx + a or y= a + bx is called regression line Geometrical meaning of a and b. b: is called regression coefficient, slope of the best-fitting line or regression line; a: y-intercept of the regression line. The principle of finding the values a and b, given x 1,x 2,…x n and y 1,y 2,…y n. Minimizing the sum of squared residuals, i.e. Σ( y i -(a+bx i ) ) 2 → min

29 Residuals, example 3. (x 1,y 1 ) b*x 1 +a y 1 -(b*x 1 +a) y 2 -(b*x 2 +a) y 6 -(b*x 6 +a)

30

31 Equation of regression line for the data of Example 1. y=1.016·x+15.5 the slope of the line is Prediction based on the equation: what is the predicted score for language for a student having 400 points in math? y predicted =1.016 · =421.9

32 Computation of the correlation coefficient from the regression coefficient. There is a relationship between the correlation and the regression coefficient: where s x, s y are the standard deviations of the samples. From this relationship it can be seen that the sign of r and b is the same: if there exist a negative correlation between variables, the slope of the regression line is also negative.

33 Hypothesis tests for the parameters of the regression line Does y really depend on x? (not only in the sample, but also in the population?) Assumption: the two samples are drawn from a bivariate normal distribution One possible method: t-test for the slope of the regression line H 0 : b pop =0 the slope of the line is 0 (horizontal line) Ha: b pop ≠0 If H0 is true, then the statistic t= b/SE(b) follows t-distribution with n-2 degrees of freedom 33

34 Hypothesis tests for the parameters of the regression line Does y really depend on x? (not only in the sample, but also in the population?) Another possible method (equivalent) F-test for the regression – analysis of variance for the regression Denote the estimated value The following decomposition is true: 34 Total variation of y =Variation doe to its dependence on x +Residual SStotSSxSSe

35 Analysis of variance for the regression 35 Source of variation Sum of Squares Degrees of freedom VariancieF RegressionSSr1 ResidualSShn-2SSh/n-2 TotalSStotn-1 F has two degrees of freedom: 1 and n-2. This test is equivalent to the t-test of the slope of the line, and to the t-test of the coefficient of correlation (they have the same p- value).

36 SPSS ouput for the relationship between age and body mass Equation of the regression line: y=0.078x Coefficient of correlation, r=0.018 Significance of the regression (=significance of correlation), p=0.833 Significance of the slope =significance of the correlation, p=0.833 Significance of the constant term, p<0.0001

37 SPSS output for the relationship between body mass 3 years ago and at present Coefficient of correlation, r=0.873 Equation of the regression line: y=0.96x Significance of the regression (=significance of correlation), p<0.0001

38 Coefficient of determination The square of the correlation coefficient multiplied by 100 is called the coefficient of determination. It shows the percentages of the total variation explained by the linear regression. Example. The correlation between math aptitude and language aptitude was found r =0,9989. The coefficient of determination, r 2 = So 91.7% of the total variation of Y is caused by its linear relationship with X. r2 from the ANOVA table: r2 = Regression SS/Total SS= = / =

39 Regression using transformations Sometimes, useful models are not linear in parameters. Examining the scatterplot of the data shows a functional, but not linear relationship between data.

40 Example A fast food chain opened in Each year from 1974 to 1988 the number of steakhouses in operation is recorded. The scatterplot of the original data suggests an exponential relationship between x (year) and y (number of Steakhouses) (first plot) Taking the logarithm of y, we get linear relationship (plot at the bottom)

41 Performing the linear regression procedure to x and log (y) we get the equation log y = x that is y = e x =e e x = 1.293e x is the equation of the best fitting curve to the original data.

42 log y = xy = 1.293e x

43 Types of transformations Some non-linear models can be transformed into a linear model by taking the logarithms on either or both sides. Either 10 base logarithm (denoted log) or natural (base e) logarithm (denoted ln) can be used. If a>0 and b>0, applying a logarithmic transformation to the model

44 Exponential relationship ->take log y Model: y=a*10 bx Take the logarithm of both sides: lg y =lga+bx so lg y is linear in x

45 Logarithm relationship ->take log x Model: y=a+lgx so y is linear in lg x

46 Power relationship ->take log x and log y Model: y=ax b Take the logarithm of both sides: lg y =lga+b lgx so lgy is linear in lg x

47 Log10 base logarithmic scale

48 Logarithmic papers Semilogarithmic paper log-log paper

49 Reciprocal relationship ->take reciprocal of x Model: y=a +b/x y=a +b*1/x so y is linear in 1/x

50 Example from the literature

51

52 Example 2. EL HADJ OTHMANE TAHA és mtsai: Osteoprotegerin: a regulátor, a protektor és a marker. Orvosi Hetilap 2008 ■ 149. évfolyam, 42. szám ■ 1971–1980.

53 Useful WEB pages    =related =related  statistics/#Correlationsb statistics/#Correlationsb  ear/LogarithmicCurve.htm ear/LogarithmicCurve.htm 

54 Review questions and problems Graphical examination of the relationship between two continuous variables (scatterplot) The meaning and properties of the coefficient of correlation Coefficient of correlation and linearity The significance of correlation: null hypothesis, t-value, degrees of freedom, decision The coefficient of determination The meaning of the regression line and its coefficients The principle of finding the equation of the regression line Hypothesis test for the regression Regression using transformations.

55 Problems Based on n=5 pairs of observations, the coefficient of correlation was calculated, its value r=0.7. Is the correlation signficnat at 5% level? Null and alternative hypothesis:…………………. t-value of the correlation: Degrees of freedom: Decision (t-value in the table t3,0.05=3.182) ………………….. On the physics practicals the waist circumference was measured. The measurement was repeated three times. The relationship of the first two measurements were examined by linear regression. Interpret the results below ( coefficient of correlation, coefficient of determination, the significance of correlation. Null hypothesis, t-value, p-value, the equation of the regression line)

56 The origin of the word „regression”. Galton: Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute 1886 Vol.15,

57