Presentation on theme: "1 Correlation and Simple Regression. 2 Introduction Interested in the relationships between variables. What will happen to one variable if another is."— Presentation transcript:
1 Correlation and Simple Regression
2 Introduction Interested in the relationships between variables. What will happen to one variable if another is changed? To what extent is it the case that increases in the interest rate reduce inflation? Might want to know how sensitive the relationship is, and if possible, what form it takes. Models needed.
3 Koops Deforestation Data Y – average annual forest loss, as % of total forested area X - #people per 1000 hectares Date on 70 tropical countries (N=70)
4 Figure 1.1 Deforestation/Population Density Data with Line of Best Fit
5 Predicted Value of Forest Loss Given Population Density
6 X=2000 implies Y=2.3 If there are 2000 people per 1000 hectares, forest loss would be about 2.3%. Comments i) Increased dispersion about the line as X increases; more uncertainty about predictions for higher population densities. ii) Ignores other impacts on deforestation.
7 Correlation Objectives of Correlation To measures how close the relationship between two variables is to linearity – strength of linear association Capture the sign of relationship Determine on common scale for all cases: -1 to +1 Closer to zero, weaker correlation
8 Sample Covariance X and Y vary about their mean values. To what extent is this variation aligned?
9 Scatter Plot of Forest Loss Against Population Density: Axes Crossing at Mean Points
10 Deviations from Mean same sign opposite sign
11 Sample Covariance Formula Problem: varies with the scale of the data
12 Sample Correlation I Standardise using sample standard deviations Sample variance: Sample standard deviation:
13 Sample Correlation II
14 Calculations for Deforestation Data
15 Correlation and Causality Must distinguish between causality and correlation. Correlated does not imply causality. Not even an indication from a correlation of which way the causality should run (from X to Y or the other way round). Two trending time series variables may be spuriously correlated. Causality is judgmental.
16 Example: UK Aggregate Consumption and Income Aggregate UK consumption and income over a period of years is highly correlated. Economists believe there is a relationship between these two variables. Take correlation to be evidence in favour of the existence of a causal relationship: income causes consumption.
17 Time Series Plot of UK Aggregate Consumption and Income
18 Scatter Plot of UK Aggregate Consumption Against Income
19 Another Example Ratio of unemployment benefit to wages, X, and the unemployment rate, Y. Annual observations for for the UK. Theory: X causes Y Policy implication: r>0 implies cut benefits relative to wages to reduce unemployment.
20 Scatter Plot of Unemployment Against Wage/Benefit Ratio What happens to r if the following observation is not included? r =
21 Final Comments Correlation measures linear association on scale [-1,+1]. r=-1,+1 indicates PERFECT linear correlation (exact straight line). Only concerned with the relationship between TWO variables (bivariate). This measure is sensitive to outliers. Correlation may be taken as supportive evidence of a causal relationship, but correlation does not imply causality.
22 Bivariate regression Correlation can: Indicate the strength of a relationship It cannot: Contribute to an understanding of how the variables may be related Make predictions about Y based on knowledge of X Regression analysis can: Examine the nature of the relationship between X and Y Make predictions from that.
23 Figure 2.1 Deforestation/Population Density Data with Line of Best Fit
24 Introduction What is the line of best fit? How can it be defined? What does it mean? Can place line by eye, but non- systematic.
25 UK consumption-income scatter plot gives a very strong indication of a linear relationship.
26 UK unemployment-benefit to wage ratio plot does not look linear.
27 Models Simplest model: straight line Too constrained – will never hold exactly. Allow for disturbances for each case, i=1,2,…,N Properties of disturbances: on average zero, but they vary. They have: mean zero, and variance denoted:
30 So what? We have a theory that allows us to think of there being an underlying linear relationship, but one that isnt exact. This fits with what we observe. It leads to a statistical theory of errors, the real life equivalent of the theoretical disturbances, that eventually allows testing of various sorts.
31 Least Squares Line: Bivariate Linear Regression Want the BEST LINEAR description of the way Y depends on X Deforestation on population density, or consumption on income, or unemployment on the benefit to wage ratio. Geometrically, we want the best fitting straight line to the data presented on a scatter plot. Needs to be defined
32 error Lots of big errors, e i error Errors smaller here
33 Want calculate best values i=1,2,...,N. in
34 Equation of the fitted line – note that subscripts are not used here: Predicted (fitted) value of Y i given X i
35 YiYi XiXi (X i,Y i )
36 The Error Also called the RESIDUAL There are N, of these, one for each i=1,2,…,N
37 The Best Line Actually, a best line – others can be defined That which minimises the sum of the squares of the errors