# S3: Chapter 5 – Regression and Correlation

## Presentation on theme: "S3: Chapter 5 – Regression and Correlation"— Presentation transcript:

S3: Chapter 5 – Regression and Correlation
Dr J Frost Last modified: 13th February 2015

What this chapter is mostly about
One question that naturally arises amongst teachers at Tiffin is whether the 11+ tests are an effective predictor of academic success later on. Tiffin has recently dropped its Non-Verbal/Verbal tests 11+ in favour of English/Maths tests. It will be interesting to see to what extent the correlation between 11+ scores and later metrics (e.g. average A2 point scores) increases. We could just calculate the PMCC of the two variables, but we might just be interested in comparing the rankings. 11+ NVR Score Avg AS point score 119 103 110 37 287 265 137 300 * Disclaimer: These values are made up!

Tiffin Data Fun Facts Fun True Fact: The PMCC between the 11+ ranks of students in the current L6 and their last JMC score ranks is ? Fun True Fact: The PMCC between the NVR ranks and C1 test ranks (when taken in Year 11) is For Year 11s in 2010. ? Fun True Fact: The PMCC between Year 7 end-of-year test rank and Year 9 test rank is For Year 11s in 2010. ? Fun True Fact: The PMCC between Year 8 end-of-year test rank and Year 9 test rank is For Year 11s in 2010. ? Fun True Fact: The PMCC between: NVR + Year 9 test rank: 0.12. VR + Year 9 test rank: 0.16 For Year 9s in 2009. NVR% VR% All Tiffinians 84% Oxbridge Tiffs 89% 85% ? ? ? ? Year 7s in 2007

RECAP: Product Moment Correlation Coefficient
11+ NVR Score (𝑥) Avg AS point score (𝑦) 𝑟= 𝑆 𝑥𝑦 𝑆 𝑥𝑥 𝑆 𝑦𝑦 𝑆 𝑥𝑥 =Σ 𝑥 2 − Σ𝑥 2 𝑛 𝑆 𝑥𝑦 =Σ𝑥𝑦− Σ𝑥 Σ𝑦 𝑛 Σ𝑥= Σ𝑦=989 Σ 𝑥 2 = Σ 𝑦 2 = Σ𝑥𝑦=87618 𝑆 𝑥𝑥 = 𝑆 𝑦𝑦 = 𝑆 𝑥𝑦 =− 𝑟=−0.43 119 103 110 37 287 265 137 300 ? ? ? ? ? ? ? ? ?

Spearman’s rank correlation coefficient
However, if we’re simply interested in how the rankings are correlated, we might discard the original data and use the rankings instead. 11+ NVR Score (𝑥) Avg AS point score (𝑦) 119 103 110 37 1 287 265 137 300 2 Using your STATS mode: 𝑟=−0.4 ? 3 3 2 4 ! Spearman’s rank correlation coefficient 𝑟 𝑠 is when the data is converted to rankings before calculating the PMCC. 4 1

Interpreting 𝑟 𝑠 𝑟 𝑠 =1 𝑟 𝑠 =−1 𝑟 𝑠 =0 ? ? ?
Rankings in perfect agreement. Ranks in reverse order. No correlation in rankings.

Calculating 𝑟 𝑠 more easily
! If no tied ranks: 𝑟 𝑠 =1− 6Σ 𝑑 2 𝑛 𝑛 2 −1 where 𝑑 𝑖 is difference between each rank. (If tied ranks, calculate normal PMCC on ranked data) 11+ rank (𝒙) AS rank (𝒚) 𝒅 1 2 -1 3 4 -2 ? ? ? Σ 𝑑 2 = =14 𝑟 𝑠 =1− 6× −1 =−0.4 ? ? ?

Proof of 𝑟 𝑠 and PMCC equivalence
(Not in textbook/exam) 𝑆 𝑥𝑥 =Σ 𝑥 2 − Σ𝑥 2 𝑛 =Σ 𝑥 2 −𝑛 𝑥 2 𝑆 𝑥𝑦 =Σ 𝑥− 𝑥 𝑦− 𝑦 Since we know each of the 𝒙 are 1 to 𝒏: 𝑥 = 1 2 𝑛 Σ 𝑥 2 = 1 6 𝑛 𝑛+1 2𝑛+1 Therefore: 𝑆 𝑥𝑥 = 1 6 𝑛 𝑛+1 2𝑛+1 −𝑛 𝑛 = 𝑛 𝑛 2 −1 12 𝑆 𝑥𝑦 =Σ 𝑥− 𝑥 𝑦− 𝑦 =Σ𝑥𝑦−Σ𝑥 𝑦 −Σ 𝑥 𝑦+Σ 𝑥 𝑦 =Σ𝑥𝑦− 𝑦 Σ𝑥−𝑦Σ 𝑥 +𝑛 𝑥 𝑦 =Σ𝑥𝑦−𝑛 𝑛 =… ? = 𝑛 𝑛+1 𝑛−1 12 − 𝑥 2 + 𝑦 𝑥𝑦 = 𝑛 𝑛 2 −1 12 − 𝑥 2 −2𝑥𝑦+ 𝑦 = 𝑛 𝑛 2 −1 12 − 𝑥−𝑦 = 𝑛 𝑛 2 −1 12 − 𝑑 2 2 𝑟 𝑠 = 𝑆𝑥𝑦 𝑆 𝑥𝑥 𝑆 𝑦𝑦 = 𝑛 𝑛 2 −1 12 − 𝑑 𝑛 𝑛 2 − =1− 6Σ 𝑑 2 𝑛 𝑛 2 −1 ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Edexcel S3 June 2011 Q2 ? Bro Exam Tip: Use your calculator STATS mode to calculate 𝑟 on your ranked data and check your answer – guaranteed full marks every time!

Differences between 𝑟 and 𝑟 𝑠
(Bro Exam Tip: This can be tested!) Original data 𝑟<1 Ranked data 𝑟 𝑠 =1 Spearman’s Rank: Makes no assumption about original data: original data need not be linear. PMCC: We can only do a hypothesis test if the variables are (jointly) normally distributed. (We’ll do hypothesis testing in a sec)

Exercise 5A

Hypothesis Testing What would you think would be a suitable null hypothesis what analysing the correlation of two variables? The null hypothesis in general is when the data is random, i.e. in this case, that there is no linear correlation between them. ? Now suppose the two variables were each normally distributed. 𝑓(𝑥) 𝑓(𝑦) English mark 𝑥 Maths mark 𝑦 See Demo > (File Ref: PMCC_Correlation_Model)

Questions from Demo Given the points were randomly generated, what do we expect the correlation to be? 0: if the data was randomly generated and the variables were independent there’s no inherent connection between them. Is it possible that for some randomly generated independent data, the correlation may be high? Yes, just by chance they could show either positive or negative correlation. ? ? ! 𝜌 (Greek letter “rho”) is a population parameter which is the actual correlation between variables 𝑋 and 𝑌. ! 𝑟 is the observed correlation from a sample. This varies across samples. 𝑓(𝑟) 𝑟 1 -1 We saw in the demo that when 𝜌=0, 𝑟 jumped around symmetrically about 𝜌. This forms an (incredibly complicated) sampling distribution for 𝒓. We might be interested in knowing the critical value 𝑐 at which the probability 𝑟 is above it is 5%, i.e. the point at which any correlation seen is considered to be significant (were we assuming any correlation there is, is by chance) 5%

Correlation Coefficient Table
Formula booklet (note 𝜌=0, i.e. we’re always assuming no correlation in S3) In our demo our sample size was 𝑛=10 and 𝜌=0 Determine: The critical region at which we have a significant positive correlation (significance level 5%) 𝑟>0.5494 Critical region at which we have a significant correlation (significance level 5%) 𝑟<− 𝑜𝑟 𝑟>0.6319 Critical region at which we have a significant correlation (significance level 1%) 𝑟<− 𝑜𝑟 𝑟>0.7646 ? ? ?

Null/Alternative Hypotheses?
Example Hypothesis Test The product-moment correlation coefficient between 30 pairs of reactions is 𝑟=−0.45. Using a 0.05 significance level, test whether or not 𝜌 differs from 0. 𝐻 0 :𝜌=0 𝐻 1 :𝜌≠0 Critical region: 𝑟<− and 𝑟> −0.45<− therefore value of 𝑟 is significant. Reject 𝐻 0 and accept 𝐻 1 . There is evidence of some correlation. Null/Alternative Hypotheses? Critical Region? Conclusion?

The table shows the BMI (Body Mass Index) of a number of people along with their age. What assumption are we making about the data in order to carry out a hypothesis test on the Product Moment Correlation Coefficient? Carry out a suitable hypothesis test at the 5% level that age and BMI are correlated. Age 26 30 31 50 42 BMI 18 21 20.5 24 17 That age and BMI are normally distributed. 𝑯 𝟎 :𝝆=𝟎 𝑯 𝟏 :𝝆≠𝟎 𝒓=𝟎.𝟒𝟓𝟓 𝟎.𝟒𝟓𝟓<𝟎.𝟖𝟕𝟖𝟑 therefore do not reject 𝑯 𝟎 . There is insufficient evidence for correlation between age and BMI. ? ?

Hypothesis Testing with Spearman’s Rank
Why do you think we might use a different table for hypothesis testing with Spearman’s rank? For the PMCC, the distribution of 𝒑(𝒓) was produced by repeatedly sampling from two (jointly) normally distributed variables and taking the PMCC each time. i.e. The variables are assumed to be normally distributed. But with data for which we’re calculating 𝒓 𝒔 , the data in each variable is always 1 to 𝒏! ? Sampling distribution of 𝑟 𝑠 when the sample size is say 2 or 3: 𝑛=2 𝑛=3 Suppose we ordered first set of data 𝑥 so that ranks are 1, 2 Suppose we ordered first set of data 𝑥 so that ranks are 1, 2, 3 Possible 𝒀 𝒓 𝒔 (1,2,3) 1 1,3,2 0.5 (2,1,3) (2,3,1) −0.5 (3,1,2) (3,2,1) Possible samples for 𝒀 𝒓 𝒔 (1,2) 1 2,1 −1 ? ? ? 𝒓 𝒔 −1 −0.5 0.5 1 𝑝 𝑟 𝑠 1 6 2 6 ? 𝒓 𝒔 −1 1 𝑝 𝑟 𝑠 1 2 ?

Example ? ?