Presentation on theme: "Correlation Coefficient & Simple Linear Regression"— Presentation transcript:
1Correlation Coefficient & Simple Linear Regression Association does not imply causationCorrelation Coefficient & Simple Linear RegressionCorrelation does not assume causality but regression does.STATS 101Laurens Holmes, Jr.
2Correlation Coefficient is also attributed to Francis Galton. Sir Galton was the first to apply the word regression to biological and psychological data. Specifically, Galton observed the heights of children versus the heights of their parents. He discovered that taller than average parents tended to have children who were also taller than average, but not as tall as their parents. Galton characterized this as regression toward mediocrity.Correlation Coefficient is also attributed to Francis Galton.Regression implies “…….to go backward”, Why are statistical methods for predicting a response from an explanatory variable termed “regression”?SIR FRANCIS GALTON ( )
3Correlation rLinear relationships implying straight line association are visualized with scatter plotsStrong linear relationshipWhen the points lie close to a straight line, and weak if they are widely scattered
4Correlation rPurpose: Measures the direction and strength of the linear relationship between two quantitative variablesRepresented by r.There is no assumption of causalityAssumes a linear association between two variables.
5r measures only a straight line relationship Correlation rr measures only a straight line relationshipFormular = 1/n-1 Σ (x1 – x/sx)(y1-y/sy)VignetteSuppose the height of 64 children with OI in our sample is designated by x and their weight by y, and n=64 (sample size). If the values of patient 1 is x1 and y1, patient 2 is x2 and y2 and so on till we obtain the values for patient 64. The means and SD of the height and weight x and sx for the height and y and sy for the weight. What is r?
6InterpretationsX1-x/sx is the standardized height of the height and SD of OI patients in centimetersThis means how many SD above or below the mean of a patient with OI liesStandardized values have no unitsThe r simply is the an average of the products of the standardized height and standardized weight of n people/patients with OI or people.
7Vignette The next slide is: The hypothetical systolic BP and age of twenty CP children in a sample at the no-city hospital.The hypothetical weight and age of twenty CP children in a sample at the no-city hospital.Computing the correlation, is there a relationship between SBP and age, as well as weight and age in this sample? Also, what do you see in the scatter plot?What is the interpretation of your finding?
8Table 1. BP and Age of Children with CP Weight (kg)Age3812.54512.13513.65010.06011.212.03013.45113.85316.84015.64312.3394112.713.75612.852111.66214.013.04411.9SBPAge9012.58812.110013.67010.08011.212.013.410213.812016.811015.68912.312.713.7879312.88211.614.013.08611.9
9Correlation r – basic assumptions No distinction between explanatory (x) and response (y) variable.The null hypothesis test that r is significantly different from zero (0).Requires both variables to be quantitative or continuous variablesBoth variables must be normally distributed. If one or both are not, either transform the variables to near normality or use an alternative non-parametric test of SpearmanUse Spearman Correlation coefficient when the shape of the distribution is not assumed or variable is distribution-free.
10Correlation r – basic assumptions No categorical or nominal variablesr does not change when we change the units of measurement. For example, from Kg to pounds for weight. Why?r uses standardized values of the observations.r does not measure nor describe curved or non-linear association no matter how strong.Like the mean and SD, r is not resistant or uninfluenced by outliers.r is strongly affected by outlier or outlying observations.
11Figure 1. Scatter plot of the relationship between SPB and age of children with CP (hypothetical data)
12Normality test : weight, age, SBP, age Skewness/Kurtosis tests for NormalityjointVariable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2weightkg |age |Skewness/Kurtosis tests for NormalityjointVariable | Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2spb |age |
13STATA Output – Correlation coefficient (Pearson) pwcorr spb age, obs sig star(5)| spb agespb |||age | *||Non-significant correlation does not imply no association
14Scatter plot of the relationship between weight and age of children with CP (hypothetical data)
15What is the correct stats technique? STATA Output – Correlation coefficient (Pearson) versus Spearman Rank Correlationpwcorr weight age, obs sig star(5)| weight ageweight |||age ||| \What is the correct stats technique?spearman weightkg age, stats(rho obs p) star(0.05)Number of obs =Spearman's rho =Test of Ho: weightkg and age are independentProb > |t| =
16Correlation r - Interpretation Positive r indicates positive linear association between x and y or variables, and negative r indicates negative linear relationshipR –s always between -1 and +1The strength increases as r moves away from zero toward wither -1 or +1The extreme values +1 and -1 indicate perfect linear relationship (points lie exactly along a straight line)Graded interpretation : r = weak; = moderate and =strong correlation
17VignetteSuppose there is a linear relationship between age of CP patients in the sample data with 66 patients and SBP, examine this relationship and interpret your results.
25InterpretationIn a sample of 66 children with CP, there is no significant relationship between age of the children and systolic BP, r = 0.02, p = 0.90.Assuming non-normal distribution of either one of the variables, a non-parametric test was used (Spearman Rank correlation), r = , p = 0.84.In either test, there is no linear relationship between age at surgery and the SBP of these patients.However the absence of a linear association does not rule out a non-linear relationship between the age of these patients and their SBP.
26Simple Linear Regression SLR does is not a measure of association but linear relationshipSimple Linear RegressionStats 101Absence of a significant association in SLR does not imply absence of non-linear association.
27Regression ModelStatistical technique for assessing the relationship between dependent and one or more independent variableThe relationship between two variables is characterized by how they vary together.Given pairs of X and Y variables, regression analysis measures the direction (positive and negative) and the rate of change in Y as X changes (Slope)
28Regression Model Adequate for predicting the value of Y, given X Inappropriate for assessing the strength of an association between two or more variablesCausal association assumed
29Simple regression model Regression equation and line represent the simple linear equation and describe the shape of the relationship between the variables.Regression line is the line drawn through scatter plot that test the fitness of the regression model like the coefficient of determination in the model
30Basic AssumptionsLinearity – The relationship between Y and X is linear (straight line relationship)Residuals are independent and normally distributedHomosedasticity - The variance of the residuals is equal for all XThere is no measurement error on X (impractical assumption) - < 10% is assumed adequate measurement error.
31Basics of SLR Different values of x will produce different values of y Uy = βo + β1xThe mean all lie on a straight lineBoth y and x vary according to normal distributionsThe normal distributions all have the same standard deviationThe explanatory variables x can take many values
32Basics of simple linear regression All means lie on a line when plotted against xThe equation of the line is μy = βo + β1x, with intercept βo and slope β1Population regression line describes how the mean response changes with xThe response y to a given x is a random variable that can take different values if we have several observations with the same x-value
33Simple linear regression model The population regression line connects mean of y with x in the populationThe slope β1 is the mean increase in y for increase in x or vice versaThe intercept βo is the starting point when x = 0.DATA = FIT + RESIDUALThe RESIDUAL represents deviations of the data from the line of population meansThe model takes the deviation to be normally distributed with standard deviation σϵ represents the residual part of the stats modelY is the sum of its mean and chance deviation ϵ from the meanThe deviation ϵ represent the noise, implying the variation in y due to other causes that prevent the observed (x,y)-values from forming a perfect straight line on a scatterplot.
34Simple linear regression model The data are n observations on an explanatory variable x and response variable y, (x1y1), (x2,y2), (x3,y3)…….., (xn,yn)The statistical model for SLR states that the observed response yi when the explanatory variable takes the value xi is:Yi=βo + β1x1 + ϵiμy= βo + β1x1 is the mean response when x = xi. The deviation ϵi are independent and normally distributed with mean 0 and SD, σThe parameters of the model are the intercept and slope of the population regression line and the variability (σ) of the response y about the line.
35Simple linear regression model Model involves parameters that are unknown (β0 and β1) but can be estimated from sample dataThe error term, ϵί termed eta is also unobservable but can be estimated from sample dataRegression coefficients are values that represent the effect of the individual independent variable (X) on the dependent variable (Y)R2 is the coefficient of determination and illustrates the amount of variation in the dependent variable that is explained by variation in the independent variable.Β0 is the intercept on Y when X=0Β1 is the slope of the regression which is increase or decrease in Y for each change in X.
36SLR : F test and t testF test is used as a general indicator of the probability that the predictor variable contribute to the variance in the dependent variable.The null hypothesis is that the predictor weight is zeroThe t test is used to test the significance of the predictor in the equation.The null hypothesis is that the predictor or independent variable does not contribute to the variance in the dependent variable.
37Vignette – Hypothetical Data Suppose you are interested in predicting the weight (gm) in pericentrin positive dwarfism based on the gestational age (wks). Is correlation coefficient appropriate test for this project? If not, select appropriate test statistic, present the regression equation, and interpret your result. Test the fitness of the model and explain coefficient of determination?
40Normality Test Is wt (gm) normally distributed? . swilk gm_wt . Shapiro-Wilk W test for normal dataVariable | Obs W V z Prob>zgm_wt |. swilk gestationalageinweeksgestation~ks |.Is gestational age (wks) normally distributed?sktest gm_wt gestationalageinweeksSkewness/Kurtosis tests for NormalityjointVariable | Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2gm_wt |gestation~ks |.
41Regression (Output) & Equation regress gm_wt gestationalageinweeks if n_catgesta==1, vce(robust)Linear regression Number of obs =F( 1, 76) =Prob > F =R-squared =Root MSE =| Robustgm_wt | Coef. Std. Err t P>|t| [95% Conf. Interval]gestation~ks |_cons |WEIGHT = grams (GESTATIONAL AGE in WEEKS)
42Regression Line, Equation, R square What is R square?Interpret the regression equation
43VignetteIn children with CP who underwent spinal fusion for curve deformities correction, can postoperative cobb angle be used in predicting their length of hospitalization? What is the regression equation? Please interpret your result.
52Result Interpretation The result from SLR states the direction, strength, value, degrees of freedom and significance level.Note that if ANOVA is not significant, the section of the output labeled sig will be > 0.05, implying that the regression equation is not significant.Statement of result: A simple linear regression was computed predicting CP children’s length of hospital stay following spinal fusion based on their postoperative cobb angle. The regression equation was not significant (F( 1,62)= 0.18, p = 0.67, with an R square ofTherefore, postoperative cobb angle cannot be used to predict the length of hospitalization following spinal fusion in CP children with scoliosis.