# Correlation and regression

## Presentation on theme: "Correlation and regression"— Presentation transcript:

Correlation and regression

Introduction Scientific rules and principles are often expressed mathematically There are two main approaches to finding a mathematical relationship between variables Analytical Based on theory Empirical Based on observation and experience

The straight line (1) Most graphs based on numerical data are curves.
The straight line is a special case Data is often manipulated to yield straight line graphs as the straight line is relatively easy to analyse

The Straight line (2) Straight line equation y = mx + c slope = m
m = Dy/Dx Intercept = c

Correlation & Regression
These are statistical processes which; Suggest the existence of a relationship Determine the best equation to fit the data Correlation is a measure of the strength of a relationship between two variables Regression is the process of determining that relationship

Correlation and Regression
The next few slides illustrate correlation and regression

No Correlation

Positive correlation

Negative correlation

Curvilinear correlation

Correlation coefficient
A statistical measure of the strength of a relationship between two variables. Pearson’ product-moment correlation coefficient, r Spearman’s rank correlation coefficient, r All these take a value in the range -1.0 to + 1.0 r or r = +1.0 represents a perfect positive correlation r or r = -1.0 represents a perfect negative correlation r or r = 0.0 represents a no correlation values of r or r are associated with a probability of there being a relationship.

Linear regression Is the process of trying to fit the best straight line to a set of data. The usual method is based on minimising the squares of the errors between the data and the predicted line For this reason, it is called “the method of least squares”

Linear regression - assumptions
The error in the independent (x) variable is negligible relative to the error in the dependant (y) variable The errors are normally, independently and identically distributed with mean 0 and constant variance - NIID(0,s2)

Linear regression model
For a set of data, (x,y), there is an equation that best fits the data of the form Y = a + bx + e x is the independent variable or the predictor y is the measured dependant or predicted variable Y is the calculated dependant or predicted variable e is the error term and accounts for that part of Y not “explained” by x. For any individual data point, i, the difference between the observed and predicted value of y is called the residual, ri i.e. ri = yi – Yi = yi - (a + bxi) The residuals provide a measure of the error term

Regression analysis (1)
Check the correlation coefficient Null Hypothesis H0: There is no correlation between x & y H1: There is a correlation between x & y Decision rule reject H0 if |r|  critical value at a = 0.05 If you cannot reject H0 then proceed no further, otherwise carry out a full regression

Regression analysis (2)
Regression analysis can be carried out using either Excel or Minitab. Excel will need the analysis ToolPak add-in installed. The output from both Minitab and Excel will give the following information The regression equation ( in the form y = a + bx) Probabilities that a  0 and b  0 The coefficient of determination, R2 Analysis of variance In addition you will need to produce at least one of Residuals vs. fitted values Residuals vs. x-values Residuals vs. y values

Interpreting output Regression equation:- this is the equation that best fits the data and provides the predicted values of y Analysis of variance:- Determines the proportion of the variation in x & y that can be accounted for by the regression equation and what proportion is accounted for by the error term. The p-value arising out of this tells us how well the regression equation fits the data. The proportion of the variation in the data accounted for by the regression equation is called the coefficient of determination, R2 and is equal to the square of the correlation coefficient

Output plots The output plots are used to check the assumptions about the errors The normal probability plot should show the residuals lying on a straight line. The residual plots should have no obvious pattern and should not show the residuals increasing or decreasing with increase in the fitted or measured values.

Non linear relationships
Many functions can be manipulated mathematically to yield a straight line equation. Some examples are given in the next few slides

Linearisation (2)

Linearisation (3)

Functions involving logs (1)
Some functions can be linearised by taking logs These are y = A xn and y = A ekx

Functions involving logs (2)
For y = Axn, taking logs gives log y = log a + n log x A graph of log y vs. log x gives a straight line, slope n and intercept log A. To find A you must take antilogs (= 10x)

Functions involving logs (3)
For y = Aekx, we must use natural logs ln y = ln A + kx This gives a straight line slope k and intercept ln A To find A we must take antilogs (= ex)

Polynomials These are functions of general formula
y = a + bx + cx2 + dx3 + … They cannot be linearised Techniques for fitting polynomials exist Both Excel and Minitab provide for fitting polynomials to data