Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIVE2602 - Engineering Mathematics 2.2 (20 credits) Statistics and Probability Lecture 10 Correlation (r-values) Linear Regression - Dependent and independent.

Similar presentations


Presentation on theme: "CIVE2602 - Engineering Mathematics 2.2 (20 credits) Statistics and Probability Lecture 10 Correlation (r-values) Linear Regression - Dependent and independent."— Presentation transcript:

1 CIVE2602 - Engineering Mathematics 2.2 (20 credits) Statistics and Probability Lecture 10 Correlation (r-values) Linear Regression - Dependent and independent variables - Ordinary Least Squares regression - Techniques to assess how good the regression is ©Claudio Nunez 2010, sourced from http://commons.wikimedia.org/wiki/File:2010_Chile_earthquake_- _Building_destroyed_in_Concepci%C3%B3n.jpg?uselang=en-gb Available under creative commons license

2 Correlation Positive correlation

3 An initial indication of the strength of the relationship may be given by the correlation coefficient r. Correlation coefficient, r r takes a value between -1 and +1 1 Often Use Excel etc

4 Interpreting the correlation coefficient (r) Maximum:r = 1Perfect correlation r = 0No correlation Minimum:r = -1Perfect inverse correlation

5 Correlation coefficient, r Correlation coefficient only a guide – need to visually inspect data as well

6 What “r” doesn’t tell you: “r” reflects the STRENGTH of an association, NOT its slope Both have r = 1 “r” alone does NOT tell you whether relationship is “statistically significant” (also depends on sample size) r = 1

7 Correlation vs. Regression Correlation Test of association Both variables equal “Are variables x & y related in some way?” Example: height vs. shoe size Regression Test of prediction One (independent) variable logically precedes another (dependent) “Can I predict y if I know x?” Example: Breaking strength of a beam (y) as function of curing time (x) Distinction often ignored !

8 Breaking strength of beams against beam curing time Curing time (days) Breaking strength (kN) **Independent variable** **dependent variable** Because the observed values of X are selected, it is known as the control, or independent, variable. The second variable (Y) is called the dependent, or response variable. The response variable may be subject to measurement/experimental error

9 Regression analysis Aim: To predict y from x (‘regressing y on x’) y x

10 If we suppose that there is an approximately linear relationship between x and y, where x takes given values, we can write: This is known as the linear regression model. -possible values of the variable Y at a given value of X. -possible values of Y are obtained from the straight line relationship E = error term (a variable, we don’t know what values it will take (added to represent the fact that the relationship between X and Y is not exactly linear – i.e. there is some scatter) and are constants but are unknown at present y x

11 What does regression do… Fits the line of best fit –y = a + bx –Ordinary Least Squares (OLS) regression –minimises squared residuals In statistical terms: –Fits a General Linear Model –e.g. ~N(0,σ 2 ) Residual variation In practice we estimate the relationship using data and specify a corresponding empirical relationship.

12 Residuals – errors between line of best fit and the points

13 Dependant variable (y) Independent or Control variable (x) What’s the “best fit” line? “minimises squared residuals”

14 ©Paul Anderson 2008, sourced from http://commons.wikimedia.org/wiki/File:Traffic_on_Bolton_ Road_West_Ramsbottom_-_geograph.org.uk_-_992680.jpg?uselang=en-gb Available under creative commons license As part of an assessing the impact of a new road junction in Leeds, a relationship between car ‘idling time’ (time car is stationary for) and the exhaust emission levels of a certain pollutant is needed. (We consider the results for just one engine size.) Idling time (seconds) 22246610 Pollutant level1011162227335251 Idling time (seconds) 101214 1618 Pollutant level5562716773758288 Idling time (seconds) 18 20 Pollutant level93 100102 Step 1) Plot graph of Pollution level (y) against Idling time (x) Idling time (s) Emissions (PPM) 210 211 216 422 627 633 1052 1051 1055 1262 1471 1467 1473 1475 1682 1888 1893 1893 20100 20102

15 and are unknown =>must be estimated from the observations. We need to find the equation of the line (e.g. Find a and b) For the linear regression model, this means choosing a and b to make the overall magnitude of all the e i ’s as small as possible. Because e i may be negative or positive, we minimise the squared residuals - so instead we choose a and b such that they minimise: Note that the individual errors, e i can be calculated by:

16 These equations look a bit tricky, but are straight-forward to use. Just uses mean values and the data points How do we calculate the values of a and b (the least squares estimators) ? and and are just the means of the sample data x and y Again, we often use a package like excel to compute the values for a and b using these equations

17 Idling time (s) Emissions (PPM) 210 211 216 422 627 633 1052 1051 1055 1262 1471 1467 1473 1475 1682 1888 1893 1893 20100 20102 71117632.553529.5

18 71117632.553529.5 We can now use the equation to predict values of y (pollutant level) for given values of x (idling time) (should not try to predict the other way around!) e.g when idling time is 15seconds, pollutant level= 76.5ppm

19 Example: Find the linear regression model for the data below: and and are just the means Curing time (days) Breaking strength 12 13 22 24 35 44 56 108 b=0.62903 a=2.04839

20 How good is the regression? How well does our data fit the straight line we have produced?

21 How good is the regression? 3 techniques to be aware of – 1) Coefficient of Determination 2) Examine the residuals 3) Significance test

22

23 1) Coefficient of Determination (R 2 value) R 2 = 1 (100%) indicates data is a perfect fit to the line (100% of variation in y is explained by relationship) R 2 = 0.5 (50%) indicates 50% of the total variation in y can be explained by the model (as described by the regression equation). The other 50% of the total variation in y remains unexplained. Indicates how well a regression line approximates the real data points.

24 2) Examine the residuals We’ll look at Significance tests on the regression parameters next lecture

25 What should the plot of the residuals look like? -Random Plot of residuals against y-data

26 What should the plot of the residuals look like? -Pattern NOT random – could be indicative that a linear model was not appropriate - perhaps need to transform one or both of the variables- (Next lecture)

27 -variance in the residuals (non-constant variance in error terms – undermines one of the assumptions of least squares) Residual variation ~N(0,σ 2 )

28 If we see patterns in the data it may because it would have better to have done a non-linear regression (e.g log graph, quadratic, other polynomial etc) We will look at this next lecture

29 If we see patterns in the data it may because it would have better to have done a non-linear regression (e.g log graph, quadratic, other polynomial etc) We will look at ways to do this next lecture

30 If we see patterns in the data it may because it would have better to have done a non-linear regression (e.g log graph, quadratic, other polynomial etc) We will look at ways to do this next lecture

31 Residuals against the explanatory variables in the model. The residuals should have no relation to these variables (look for possible non-linear relations) and the spread of the residuals should be the same over the whole range. Even in this case where the residuals look random it can be worth trying to transform either X or Y to see if we can improve the regression Excel

32 -What do you need to be able to do? Take some data (select which is independent, dependant variable) Visualise the date -Plot a scatter graph Calculate S xx, S xy (from equations) so that regression line can be found (can use Excel to make life easier) Calculate the Regression line: y = ax + b (find a and b). Check how good the regressions is (3 techniques) Regression

33 Coursework Available today Submission date: 17 th March 12 sides maximum Need to submit online (VLE) and hardcopy by 4pm (late penalties apply until both submitted) Major coursework rules apply. Please take care not to plagiarise! –It will be taken very seriously, group work on this major coursework would constitute plagiarism (plagiarism software is a now used as normal practice on all submissions). Two people got zero last year for being caught plagiarising. For question 2 of the coursework you only compare two sets of data – check the group sheet to know which two sets Examples classes- Please don’t ask any direct questions to do with the coursework as tutors will only help with general and examples Class material

34 http://www.youtube.com/watch?v=ocGEhiLwDVc

35 The plot below shows residuals from a simple regression. What, if anything, is of greatest concerning about these residuals? (A) They exhibit heteroscedasticity. (B) They exhibit homoscedasticity. (C) They show serial correlation. (D) They are not distributed (0, σ2). (E) They are perfectly normal. 0031v01

36 What is the greatest concern about the regression below? (A) It has a small slope. (B) It has a high R 2. (C) The investigator should not be using a linear regression on these data. (D) The residuals are too large. (E) The regression line does not pass through the origin. 0032v01

37 The scatter plot below shows Norman temperatures in degrees C and F. Should regression be used on these data? Why or why not? (A) Yes, they yield a high R 2, therefore they are ideal for regression. (B) Yes, they exhibit a strong linear relationship. (C) No, they represent a functional relationship, not a statistical one. (D) No, the variance envelope is heteroscedastic. (E) No, the variance envelope is homoscedastic. 0033v01

38 What is the most common rationale for significance testing of regression? (A) to test if the intercept is significantly large (B) to test is the slope of the regression line is positive (C) to test if the slope of the regression line is negative (D) to test if the slope is different from zero (E) to appease an editor or reviewer when publishing the results 0034v01

39 Why is it important to look for outliers in data prior to applying regression? (A) Outliers always affect the magnitude of the regression slope. (B) Outliers are always bad data. (C) Outliers should always be eliminated from the data set. (D) Outliers should always be considered because of their potential influence. (E) We shouldn’t look for outliers, because all the data must be analyzed. 0035v01

40 Joe found a strong correlation in his study showing that individual’s physical ability decreased significantly with age. Which numerical result below best describes this situation? (A) -1.2 (B) -1.0 (C) -0.8 (D) +0.8 (E) +1.0 (F) +1.2 0084v01

41 A researcher found that r = +.92 between the high temperature of the day and the number of ice cream cones sold at Cone Island. This result tells us that (A) high temperatures cause people to buy ice cream. (B) buying ice cream causes the temperature to go up. (C) some extraneous variable causes both high temperatures and high ice cream sales. (D) temperature and ice cream sales have a strong positive linear relationship. 0085v01

42 In a study of caffeine’s impact on creative problem-solving, researchers found a r = +0.20 correlation between levels of caffeine consumption and total number of creative solutions generated. This result suggests that (A) there is a weak-to-moderate relationship between levels of caffeine consumption and total number of creative solutions generated. (B) there is no statistically significant correlation between levels of caffeine consumption and total number of creative solutions generated. (C) there is possibly a non-linear relationship between levels of caffeine consumption and total number of creative solutions generated. (D) At least two (A)-(C) may happen simultaneously. (E) All of (A)-(C) may happen simultaneously. 0086v01

43 You are conducting a correlation analysis between a response variable and an explanatory variable. Your analysis produces a significant positive correlation between the two variables. Which of the following conclusions is the most reasonable? (A) Change in the explanatory variable causes change in the response variable (B) Change in the explanatory variable results in change in the response variable (C) Change in the response variable causes change in the explanatory variable (D) Change in the response variable results in change in the explanatory variable (E) All from (A)-(D) are equally reasonable conclusions 0087v01

44 Which phrase best describes the scatterplot? (A) strong +r (B) strong -r (C) weak +r (D) weak -r (E) influential outliers (F) non-linearity (G) measurement problem (H) Two from (A)-(G) are true. (I) Three from (A)-(G) are true. 0088v01

45 Which correlation best describes the scatterplot? (A) -0.7 (B) -0.3 (C) 0 (D) +0.3 (E) +0.7 0089v01

46 Which of the following factors is NOT important to consider when interpreting a correlation coefficient? (A) restriction of range (B) problems associated with aggregated data (C) outliers (D) lurking variables (E) unit of measurement 0090v01

47 If you strongly believed in the idea that the more hours per week full-time students worked, the lower their GPA would be, then which correlation would you realistically expect to find? (A) -0.97 (B) -0.68 (C) -0.20 (D) +0.20 (E) +0.68 (F) +0.97 0091v01


Download ppt "CIVE2602 - Engineering Mathematics 2.2 (20 credits) Statistics and Probability Lecture 10 Correlation (r-values) Linear Regression - Dependent and independent."

Similar presentations


Ads by Google