Correlation and Regression

Slides:



Advertisements
Similar presentations
T-tests, ANOVA and regression Methods for Dummies February 1 st 2006 Jon Roiser and Predrag Petrovic.
Advertisements

T tests, ANOVAs and regression
Andrea Banino & Punit Shah. Samples vs Populations Descriptive vs Inferential William Sealy Gosset (Student) Distributions, probabilities and P-values.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Kin 304 Regression Linear Regression Least Sum of Squares
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Jennifer Siegel. Statistical background Z-Test T-Test Anovas.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Greg C Elvers.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Simple Linear Regression. G. Baker, Department of Statistics University of South Carolina; Slide 2 Relationship Between Two Quantitative Variables If.
Ch11 Curve Fitting Dr. Deshi Ye
Objectives (BPS chapter 24)
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
SIMPLE LINEAR REGRESSION
Chapter Topics Types of Regression Models
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
Simple Linear Regression Analysis
REGRESSION AND CORRELATION
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
SIMPLE LINEAR REGRESSION
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Lecture 5: Simple Linear Regression
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Lorelei Howard and Nick Wright MfD 2008
Introduction to Regression Analysis, Chapter 13,
Relationships Among Variables
Correlation & Regression Math 137 Fresno State Burger.
Lecture 5 Correlation and Regression
Correlation and Regression A BRIEF overview Correlation Coefficients l Continuous IV & DV l or dichotomous variables (code as 0-1) n mean interpreted.
Lecture 16 Correlation and Coefficient of Correlation
Regression and Correlation Methods Judy Zhong Ph.D.
SIMPLE LINEAR REGRESSION
Inference for regression - Simple linear regression
CORRELATION & REGRESSION
Introduction to Regression Analysis. Two Purposes Explanation –Explain (or account for) the variance in a variable (e.g., explain why children’s test.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Simple Linear Regression One reason for assessing correlation is to identify a variable that could be used to predict another variable If that is your.
Introduction to Linear Regression
BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression.
Warsaw Summer School 2015, OSU Study Abroad Program Regression.
Chapter 14 Inference for Regression AP Statistics 14.1 – Inference about the Model 14.2 – Predictions and Conditions.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
MARKETING RESEARCH CHAPTER 18 :Correlation and Regression.
Lecture 10: Correlation and Regression Model.
Correlation & Regression Analysis
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Regression Analysis Deterministic model No chance of an error in calculating y for a given x Probabilistic model chance of an error First order linear.
4 basic analytical tasks in statistics: 1)Comparing scores across groups  look for differences in means 2)Cross-tabulating categoric variables  look.
CORRELATION ANALYSIS.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
BPS - 5th Ed. Chapter 231 Inference for Regression.
The simple linear regression model and parameter estimation
Regression Analysis.
Regression Analysis AGEC 784.
Linear Regression Prof. Andy Field.
BPK 304W Correlation.
CHAPTER 29: Multiple Regression*
6-1 Introduction To Empirical Models
Correlation and Regression
CORRELATION ANALYSIS.
Simple Linear Regression
Product moment correlation
Chapter 14 Inference for Regression
3 basic analytical tasks in bivariate (or multivariate) analyses:
Presentation transcript:

Correlation and Regression Davina Bristow & Angela Quayle

Topics Covered: Is there a relationship between x and y? What is the strength of this relationship Pearson’s r Can we describe this relationship and use this to predict y from x? Regression Is the relationship we have described statistically significant? t test Relevance to SPM GLM

The relationship between x and y Correlation: is there a relationship between 2 variables? Regression: how well a certain independent variable predict dependent variable? CORRELATION  CAUSATION In order to infer causality: manipulate independent variable and observe effect on dependent variable

Scattergrams Positive correlation Negative correlation No correlation Y X Positive correlation Negative correlation No correlation

Variance vs Covariance First, a note on your sample: If you’re wishing to assume that your sample is representative of the general population (RANDOM EFFECTS MODEL), use the degrees of freedom (n – 1) in your calculations of variance or covariance. But if you’re simply wanting to assess your current sample (FIXED EFFECTS MODEL), substitute n for the degrees of freedom.

Variance vs Covariance Do two variables change together? Variance: Gives information on variability of a single variable. Covariance: Gives information on the degree to which two variables vary together. Note how similar the covariance is to variance: the equation simply multiplies x’s error scores by y’s error scores as opposed to squaring x’s error scores.

Covariance When X and Y : cov (x,y) = pos. When X and Y : cov (x,y) = neg. When no constant relationship: cov (x,y) = 0

Example Covariance What does this number tell us? x ( )( ) 3 - 3 2 2 - y x i - y i - ( x i - )( y i - ) 3 - 3 2 2 - 1 - 1 1 3 4 1 4 1 - 3 - 3 6 6 3 3 9 3 = x 3 = y å = 7 What does this number tell us?

Problem with Covariance: The value obtained by covariance is dependent on the size of the data’s standard deviations: if large, the value will be greater than if small… even if the relationship between x and y is exactly the same in the large versus small standard deviation datasets.

Example of how covariance value relies on variance   High variance data  Low variance data Subject x y x error * y error X error * y error 1 101 100 2500 54 53 9 2 81 80 900 52 4 3 61 60 51 50 5 41 40 49 6 21 20 48 7 47 Mean Sum of x error * y error : 7000 28 Covariance: 1166.67 4.67

Solution: Pearson’s r Covariance does not really tell us anything Solution: standardise this measure Pearson’s R: standardises the covariance value. Divides the covariance by the multiplied standard deviations of X and Y:

Pearson’s R continued

Limitations of r When r = 1 or r = -1: We can predict y from x with certainty all data points are on a straight line: y = ax + b r is actually r = true r of whole population = estimate of r based on data r is very sensitive to extreme values:

Regression Correlation tells you if there is an association between x and y but it doesn’t describe the relationship or allow you to predict one variable from the other. To do this we need REGRESSION!

Best-fit Line ŷ = ax + b ε = ŷ, predicted value = y i , true value Aim of linear regression is to fit a straight line, ŷ = ax + b, to data that gives best prediction of y for any value of x This will be the line that minimises distance between data and fitted line, i.e. the residuals ε ŷ = ax + b ε = residual error = y i , true value slope intercept = ŷ, predicted value

Least Squares Regression To find the best line we must minimise the sum of the squares of the residuals (the vertical distances from the data points to our line) Model line: ŷ = ax + b a = slope, b = intercept Residual (ε) = y - ŷ Sum of squares of residuals = Σ (y – ŷ)2 we must find values of a and b that minimise Σ (y – ŷ)2

Finding b First we find the value of b that gives the min sum of squares b b ε ε b Trying different values of b is equivalent to shifting the line up and down the scatter plot

Finding a Now we find the value of a that gives the min sum of squares b b b Trying out different values of a is equivalent to changing the slope of the line, while b stays constant

Minimising sums of squares Values of a and b sums of squares (S) Gradient = 0 min S Need to minimise Σ(y–ŷ)2 ŷ = ax + b so need to minimise: Σ(y - ax - b)2 If we plot the sums of squares for all different values of a and b we get a parabola, because it is a squared term So the min sum of squares is at the bottom of the curve, where the gradient is zero.

The maths bit The min sum of squares is at the bottom of the curve where the gradient = 0 So we can find a and b that give min sum of squares by taking partial derivatives of Σ(y - ax - b)2 with respect to a and b separately Then we solve these for 0 to give us the values of a and b that give the min sum of squares

The solution Doing this gives the following equations for a and b: a = r sy sx r = correlation coefficient of x and y sy = standard deviation of y sx = standard deviation of x From you can see that: A low correlation coefficient gives a flatter slope (small value of a) Large spread of y, i.e. high standard deviation, results in a steeper slope (high value of a) Large spread of x, i.e. high standard deviation, results in a flatter slope (high value of a)

The solution cont. Our model equation is ŷ = ax + b This line must pass through the mean so: y = ax + b b = y – ax b = y – ax We can put our equation for a into this giving: If people want proof that the solutions for a and b are correct I can go through it now or later, but its NOT essential in terms of understanding what regression is doing. b = y - r sy sx r = correlation coefficient of x and y sy = standard deviation of y sx = standard deviation of x x The smaller the correlation, the closer the intercept is to the mean of y

Back to the model r sy x + y - ŷ = ax + b = x sx r sy (x – x) + y ŷ = Rearranges to: If the correlation is zero, we will simply predict the mean of y for every value of x, and our regression line is just a flat straight line crossing the x-axis at y But this isn’t very useful. We can calculate the regression line for any data, but the important question is how well does this line fit the data, or how good is it at predicting y from x

How good is our model? Total variance of y: sy2 = ∑(y – y)2 n - 1 SSy dfy = Total variance of y: Variance of predicted y values (ŷ): This is the variance explained by our regression model sŷ2 = ∑(ŷ – y)2 n - 1 SSpred dfŷ = Error variance: This is the variance of the error between our predicted y values and the actual y values, and thus is the variance in y that is NOT explained by the regression model serror2 = ∑(y – ŷ)2 n - 2 SSer dfer =

How good is our model cont. Total variance = predicted variance + error variance sy2 = sŷ2 + ser2 Conveniently, via some complicated rearranging sŷ2 = r2 sy2 r2 = sŷ2 / sy2 so r2 is the proportion of the variance in y that is explained by our regression model

How good is our model cont. Insert r2 sy2 into sy2 = sŷ2 + ser2 and rearrange to get: ser2 = sy2 – r2sy2 = sy2 (1 – r2) From this we can see that the greater the correlation the smaller the error variance, so the better our prediction

Is the model significant? i.e. do we get a significantly better prediction of y from our regression equation than by just predicting the mean? F-statistic: complicated rearranging F (dfŷ,dfer) = sŷ2 ser2 r2 (n - 2)2 1 – r2 =......= And it follows that: r (n - 2) So all we need to know are r and n t(n-2) = (because F = t2) √1 – r2

General Linear Model Linear regression is actually a form of the General Linear Model where the parameters are a, the slope of the line, and b, the intercept. y = ax + b +ε A General Linear Model is just any model that describes the data in terms of a straight line

Multiple regression Multiple regression is used to determine the effect of a number of independent variables, x1, x2, x3 etc, on a single dependent variable, y The different x variables are combined in a linear way and each has its own regression coefficient: y = a1x1+ a2x2 +…..+ anxn + b + ε The a parameters reflect the independent contribution of each independent variable, x, to the value of the dependent variable, y. i.e. the amount of variance in y that is accounted for by each x variable after all the other x variables have been accounted for

SPM Linear regression is a GLM that models the effect of one independent variable, x, on ONE dependent variable, y Multiple Regression models the effect of several independent variables, x1, x2 etc, on ONE dependent variable, y Both are types of General Linear Model GLM can also allow you to analyse the effects of several independent x variables on several dependent variables, y1, y2, y3 etc, in a linear combination This is what SPM does and all will be explained next week!