Presentation on theme: "S3: Chapter 5 – Regression and Correlation"— Presentation transcript:
1S3: Chapter 5 – Regression and Correlation Dr J FrostLast modified: 13th February 2015
2What this chapter is mostly about One question that naturally arises amongst teachers at Tiffin is whether the 11+ tests are an effective predictor of academic success later on.Tiffin has recently dropped its Non-Verbal/Verbal tests 11+ in favour of English/Maths tests. It will be interesting to see to what extent the correlation between 11+ scores and later metrics (e.g. average A2 point scores) increases.We could just calculate the PMCC of the two variables, but we might just be interested in comparing the rankings.11+ NVR ScoreAvg AS point score11910311037287265137300* Disclaimer: These values are made up!
3Tiffin Data Fun FactsFun True Fact: The PMCC between the 11+ ranks of students in the current L6 and their last JMC score ranks is?Fun True Fact: The PMCC between the NVR ranks and C1 test ranks (when taken in Year 11) isFor Year 11s in 2010.?Fun True Fact: The PMCC between Year 7 end-of-year test rank and Year 9 test rank isFor Year 11s in 2010.?Fun True Fact: The PMCC between Year 8 end-of-year test rank and Year 9 test rank isFor Year 11s in 2010.?Fun True Fact: The PMCC between:NVR + Year 9 test rank: 0.12.VR + Year 9 test rank: 0.16For Year 9s in 2009.NVR%VR%All Tiffinians84%Oxbridge Tiffs89%85%????Year 7s in 2007
5Spearman’s rank correlation coefficient However, if we’re simply interested in how the rankings are correlated, we might discard the original data and use the rankings instead.11+ NVR Score (𝑥)Avg AS point score (𝑦)1191031103712872651373002Using your STATS mode:𝑟=−0.4?3324! Spearman’s rank correlation coefficient 𝑟 𝑠 is when the data is converted to rankings before calculating the PMCC.41
6Interpreting 𝑟 𝑠 𝑟 𝑠 =1 𝑟 𝑠 =−1 𝑟 𝑠 =0 ? ? ? Rankings in perfect agreement.Ranks in reverse order.No correlation in rankings.
7Calculating 𝑟 𝑠 more easily ! If no tied ranks:𝑟 𝑠 =1− 6Σ 𝑑 2 𝑛 𝑛 2 −1where 𝑑 𝑖 is difference between each rank.(If tied ranks, calculate normal PMCC on ranked data)11+ rank (𝒙)AS rank (𝒚)𝒅12-134-2???Σ 𝑑 2 = =14𝑟 𝑠 =1− 6× −1 =−0.4???
9Test Your Understanding Edexcel S3 June 2011 Q2?Bro Exam Tip: Use your calculator STATS mode to calculate 𝑟 on your ranked data and check your answer – guaranteed full marks every time!
10Differences between 𝑟 and 𝑟 𝑠 (Bro Exam Tip: This can be tested!)Original data𝑟<1Ranked data𝑟 𝑠 =1Spearman’s Rank:Makes no assumption about original data: original data need not be linear.PMCC:We can only do a hypothesis test if the variables are (jointly) normally distributed.(We’ll do hypothesis testing in a sec)
12Hypothesis TestingWhat would you think would be a suitable null hypothesis what analysing the correlation of two variables?The null hypothesis in general is when the data is random, i.e. in this case, that there is no linear correlation between them.?Now suppose the two variables were each normally distributed.𝑓(𝑥)𝑓(𝑦)English mark 𝑥Maths mark 𝑦See Demo >(File Ref: PMCC_Correlation_Model)
13Questions from DemoGiven the points were randomly generated, what do we expect the correlation to be?0: if the data was randomly generated and the variables were independent there’s no inherent connection between them.Is it possible that for some randomly generated independent data, the correlation may be high?Yes, just by chance they could show either positive or negative correlation.??! 𝜌 (Greek letter “rho”) is a population parameter which is the actual correlation between variables 𝑋 and 𝑌.! 𝑟 is the observed correlation from a sample. This varies across samples.𝑓(𝑟)𝑟1-1We saw in the demo that when 𝜌=0, 𝑟 jumped around symmetrically about 𝜌. This forms an (incredibly complicated) sampling distribution for 𝒓.We might be interested in knowing the critical value 𝑐 at which the probability 𝑟 is above it is 5%, i.e. the point at which any correlation seen is considered to be significant (were we assuming any correlation there is, is by chance)5%
14Correlation Coefficient Table Formula booklet(note 𝜌=0, i.e. we’re always assuming no correlation in S3)In our demo our sample size was 𝑛=10 and 𝜌=0Determine:The critical region at which we have a significant positive correlation (significance level 5%)𝑟>0.5494Critical region at which we have a significant correlation (significance level 5%)𝑟<− 𝑜𝑟 𝑟>0.6319Critical region at which we have a significant correlation (significance level 1%)𝑟<− 𝑜𝑟 𝑟>0.7646???
15Null/Alternative Hypotheses? Example Hypothesis TestThe product-moment correlation coefficient between 30 pairs of reactions is𝑟=−0.45. Using a 0.05 significance level, test whether or not 𝜌 differs from 0.𝐻 0 :𝜌=0𝐻 1 :𝜌≠0Critical region: 𝑟<− and 𝑟>−0.45<− therefore value of 𝑟 is significant.Reject 𝐻 0 and accept 𝐻 1 .There is evidence of some correlation.Null/Alternative Hypotheses?Critical Region?Conclusion?
16Test Your Understanding The table shows the BMI (Body Mass Index) of a number of people along with their age.What assumption are we making about the data in order to carry out a hypothesis test on the Product Moment Correlation Coefficient?Carry out a suitable hypothesis test at the 5% level that age and BMI are correlated.Age2630315042BMI182120.52417That age and BMI are normally distributed.𝑯 𝟎 :𝝆=𝟎 𝑯 𝟏 :𝝆≠𝟎 𝒓=𝟎.𝟒𝟓𝟓 𝟎.𝟒𝟓𝟓<𝟎.𝟖𝟕𝟖𝟑 therefore do not reject 𝑯 𝟎 . There is insufficient evidence for correlation between age and BMI.??
17Hypothesis Testing with Spearman’s Rank Why do you think we might use a different table for hypothesis testing with Spearman’s rank?For the PMCC, the distribution of 𝒑(𝒓) was produced by repeatedly sampling from two (jointly) normally distributed variables and taking the PMCC each time. i.e. The variables are assumed to be normally distributed.But with data for which we’re calculating 𝒓 𝒔 , the data in each variable is always 1 to 𝒏!?Sampling distribution of 𝑟 𝑠 when the sample size is say 2 or 3:𝑛=2𝑛=3Suppose we ordered first set of data 𝑥 so that ranks are 1, 2Suppose we ordered first set of data 𝑥 so that ranks are 1, 2, 3Possible 𝒀𝒓 𝒔(1,2,3)11,3,20.5(2,1,3)(2,3,1)−0.5(3,1,2)(3,2,1)Possible samples for 𝒀𝒓 𝒔(1,2)12,1−1???𝒓 𝒔−1−0.50.51𝑝 𝑟 𝑠1 62 6?𝒓 𝒔−11𝑝 𝑟 𝑠1 2?