S3: Chapter 5 – Regression and Correlation

Slides:



Advertisements
Similar presentations
C3 Chapter 1 Algebraic Fractions Dr J Frost Last modified: 13 th May 2014.
Advertisements

Correlation and regression
C1 Chapter 6 Arithmetic Series Dr J Frost Last modified: 7 th October 2013.
S3: Chapter 1 – Combining Variables Dr J Frost Last modified: 28 th November 2014.
S2 Chapter 5: Normal Approximations Dr J Frost Last modified: 29 th September 2014.
C4 Chapter 2: Parametric Equations Dr J Frost Last modified: 19 th November 2014.
Chapter 15 Comparing Two Populations: Dependent samples.
S2 Chapter 7: Hypothesis Testing Dr J Frost Last modified: 3 rd October 2014.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Learning Objectives 1 Copyright © 2002 South-Western/Thomson Learning Data Analysis: Bivariate Correlation and Regression CHAPTER sixteen.
Lesson Inferences Between Two Variables. Objectives Perform Spearman’s rank-correlation test.
PSY 307 – Statistics for the Behavioral Sciences
Log-linear and logistic models
Chapter 15 Nonparametric Statistics
Nonparametric or Distribution-free Tests
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
Topics: Significance Testing of Correlation Coefficients Inference about a population correlation coefficient: –Testing H 0 :  xy = 0 or some specific.
Hypothesis Testing with Two Samples
Week 12 Chapter 13 – Association between variables measured at the ordinal level & Chapter 14: Association Between Variables Measured at the Interval-Ratio.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Means Tests Hypothesis Testing Assumptions Testing (Normality)
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Learning Objective Chapter 14 Correlation and Regression Analysis CHAPTER fourteen Correlation and Regression Analysis Copyright © 2000 by John Wiley &
14 Elements of Nonparametric Statistics
S1: Chapter 6 Correlation Dr J Frost Last modified: 21 st November 2013.
Inferential Statistics 2 Maarten Buis January 11, 2006.
One-sample In the previous cases we had one sample and were comparing its mean to a hypothesized population mean However in many situations we will use.
Copyright © 2012 Pearson Education. Chapter 23 Nonparametric Methods.
Experimental Research Methods in Language Learning Chapter 11 Correlational Analysis.
Hypothesis of Association: Correlation
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Correlation 1 Scatter diagrams Measurement of correlation Using a calculator Using the formula Practice qs.
Nonparametric Statistics aka, distribution-free statistics makes no assumption about the underlying distribution, other than that it is continuous the.
Relationship between two variables Two quantitative variables: correlation and regression methods Two qualitative variables: contingency table methods.
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Testing hypotheses Continuous variables. H H H H H L H L L L L L H H L H L H H L High Murder Low Murder Low Income 31 High Income 24 High Murder Low Murder.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Multiple Correlation and Regression
Chapter Thirteen Copyright © 2006 John Wiley & Sons, Inc. Bivariate Correlation and Regression.
Session 13: Correlation (Zar, Chapter 19). (1)Regression vs. correlation Regression: R 2 is the proportion that the model explains of the variability.
Click to edit Master title style Midterm 3 Wednesday, June 10, 1:10pm.
Welcome to MM570 Psychological Statistics
Inferences Concerning Variances
Chapter 7 Calculation of Pearson Coefficient of Correlation, r and testing its significance.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Fall 2015 Room 150 Harvill.
Lecturer’s desk Physics- atmospheric Sciences (PAS) - Room 201 s c r e e n Row A Row B Row C Row D Row E Row F Row G Row H Row A
Principles of Biostatistics Chapter 17 Correlation 宇传华 网上免费统计资源(八)
Advanced Higher STATISTICS Spearman’s Rank (Spearman’s rank correlation coefficient) Lesson Objectives 1. Explain why it is used. 2. List the advantages.
Introduction to Hypothesis Test – Part 2
Math 4030 – 10b Inferences Concerning Variances: Hypothesis Testing
Applied Statistical Analysis
CHAPTER fourteen Correlation and Regression Analysis
Spearman’s Rank Correlation Test
Inferential Statistics
S1 :: Chapter 6 Correlation
Inferential statistics,
Lecture 17 Rank Correlation Coefficient
Spearman Rank Order Correlation Example
Inferences on Two Samples Summary
Logistic Regression --> used to describe the relationship between
Statistical Inference for Managers
Introduction to Hypothesis Testing
Inferences Between Two Variables
Correlation and the Pearson r
Nonparametric Statistics
Inferential testing.
Stats Yr2 Chapter 1 :: Regression, Correlation & Hypothesis Tests
Presentation transcript:

S3: Chapter 5 – Regression and Correlation Dr J Frost (jfrost@tiffin.kingston.sch.uk) Last modified: 13th February 2015

What this chapter is mostly about One question that naturally arises amongst teachers at Tiffin is whether the 11+ tests are an effective predictor of academic success later on. Tiffin has recently dropped its Non-Verbal/Verbal tests 11+ in favour of English/Maths tests. It will be interesting to see to what extent the correlation between 11+ scores and later metrics (e.g. average A2 point scores) increases. We could just calculate the PMCC of the two variables, but we might just be interested in comparing the rankings. 11+ NVR Score Avg AS point score 119 103 110 37 287 265 137 300 * Disclaimer: These values are made up!

Tiffin Data Fun Facts Fun True Fact: The PMCC between the 11+ ranks of students in the current L6 and their last JMC score ranks is 0.280. ? Fun True Fact: The PMCC between the NVR ranks and C1 test ranks (when taken in Year 11) is 0.101. For Year 11s in 2010. ? Fun True Fact: The PMCC between Year 7 end-of-year test rank and Year 9 test rank is 0.502. For Year 11s in 2010. ? Fun True Fact: The PMCC between Year 8 end-of-year test rank and Year 9 test rank is 0.794. For Year 11s in 2010. ? Fun True Fact: The PMCC between: NVR + Year 9 test rank: 0.12. VR + Year 9 test rank: 0.16 For Year 9s in 2009. NVR% VR% All Tiffinians 84% Oxbridge Tiffs 89% 85% ? ? ? ? Year 7s in 2007

RECAP: Product Moment Correlation Coefficient 11+ NVR Score (𝑥) Avg AS point score (𝑦) 𝑟= 𝑆 𝑥𝑦 𝑆 𝑥𝑥 𝑆 𝑦𝑦 𝑆 𝑥𝑥 =Σ 𝑥 2 − Σ𝑥 2 𝑛 𝑆 𝑥𝑦 =Σ𝑥𝑦− Σ𝑥 Σ𝑦 𝑛 Σ𝑥=369 Σ𝑦=989 Σ 𝑥 2 =38239 Σ 𝑦 2 =261363 Σ𝑥𝑦=87618 𝑆 𝑥𝑥 =4198.75 𝑆 𝑦𝑦 =16832.75 𝑆 𝑥𝑦 =−3617.25 𝑟=−0.43 119 103 110 37 287 265 137 300 ? ? ? ? ? ? ? ? ?

Spearman’s rank correlation coefficient However, if we’re simply interested in how the rankings are correlated, we might discard the original data and use the rankings instead. 11+ NVR Score (𝑥) Avg AS point score (𝑦) 119 103 110 37 1 287 265 137 300 2 Using your STATS mode: 𝑟=−0.4 ? 3 3 2 4 ! Spearman’s rank correlation coefficient 𝑟 𝑠 is when the data is converted to rankings before calculating the PMCC. 4 1

Interpreting 𝑟 𝑠 𝑟 𝑠 =1 𝑟 𝑠 =−1 𝑟 𝑠 =0 ? ? ? Rankings in perfect agreement. Ranks in reverse order. No correlation in rankings.

Calculating 𝑟 𝑠 more easily ! If no tied ranks: 𝑟 𝑠 =1− 6Σ 𝑑 2 𝑛 𝑛 2 −1 where 𝑑 𝑖 is difference between each rank. (If tied ranks, calculate normal PMCC on ranked data) 11+ rank (𝒙) AS rank (𝒚) 𝒅 1 2 -1 3 4 -2 ? ? ? Σ 𝑑 2 =1+0+4+9 =14 𝑟 𝑠 =1− 6×14 4 4 2 −1 =−0.4 ? ? ?

Proof of 𝑟 𝑠 and PMCC equivalence (Not in textbook/exam) 𝑆 𝑥𝑥 =Σ 𝑥 2 − Σ𝑥 2 𝑛 =Σ 𝑥 2 −𝑛 𝑥 2 𝑆 𝑥𝑦 =Σ 𝑥− 𝑥 𝑦− 𝑦 Since we know each of the 𝒙 are 1 to 𝒏: 𝑥 = 1 2 𝑛+1 Σ 𝑥 2 = 1 6 𝑛 𝑛+1 2𝑛+1 Therefore: 𝑆 𝑥𝑥 = 1 6 𝑛 𝑛+1 2𝑛+1 −𝑛 1 2 𝑛+1 2 = 𝑛 𝑛 2 −1 12 𝑆 𝑥𝑦 =Σ 𝑥− 𝑥 𝑦− 𝑦 =Σ𝑥𝑦−Σ𝑥 𝑦 −Σ 𝑥 𝑦+Σ 𝑥 𝑦 =Σ𝑥𝑦− 𝑦 Σ𝑥−𝑦Σ 𝑥 +𝑛 𝑥 𝑦 =Σ𝑥𝑦−𝑛 𝑛+1 2 2 =… ? = 𝑛 𝑛+1 𝑛−1 12 − 𝑥 2 + 𝑦 2 2 + 𝑥𝑦 = 𝑛 𝑛 2 −1 12 − 𝑥 2 −2𝑥𝑦+ 𝑦 2 2 = 𝑛 𝑛 2 −1 12 − 𝑥−𝑦 2 2 = 𝑛 𝑛 2 −1 12 − 𝑑 2 2 𝑟 𝑠 = 𝑆𝑥𝑦 𝑆 𝑥𝑥 𝑆 𝑦𝑦 = 𝑛 𝑛 2 −1 12 − 𝑑 2 2 𝑛 𝑛 2 −1 12 2 =1− 6Σ 𝑑 2 𝑛 𝑛 2 −1 ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Test Your Understanding Edexcel S3 June 2011 Q2 ? Bro Exam Tip: Use your calculator STATS mode to calculate 𝑟 on your ranked data and check your answer – guaranteed full marks every time!

Differences between 𝑟 and 𝑟 𝑠 (Bro Exam Tip: This can be tested!) Original data 𝑟<1 Ranked data 𝑟 𝑠 =1 Spearman’s Rank: Makes no assumption about original data: original data need not be linear. PMCC: We can only do a hypothesis test if the variables are (jointly) normally distributed. (We’ll do hypothesis testing in a sec)

Exercise 5A

Hypothesis Testing What would you think would be a suitable null hypothesis what analysing the correlation of two variables? The null hypothesis in general is when the data is random, i.e. in this case, that there is no linear correlation between them. ? Now suppose the two variables were each normally distributed. 𝑓(𝑥) 𝑓(𝑦) English mark 𝑥 Maths mark 𝑦 See Demo > (File Ref: PMCC_Correlation_Model)

Questions from Demo Given the points were randomly generated, what do we expect the correlation to be? 0: if the data was randomly generated and the variables were independent there’s no inherent connection between them. Is it possible that for some randomly generated independent data, the correlation may be high? Yes, just by chance they could show either positive or negative correlation. ? ? ! 𝜌 (Greek letter “rho”) is a population parameter which is the actual correlation between variables 𝑋 and 𝑌. ! 𝑟 is the observed correlation from a sample. This varies across samples. 𝑓(𝑟) 𝑟 1 -1 We saw in the demo that when 𝜌=0, 𝑟 jumped around symmetrically about 𝜌. This forms an (incredibly complicated) sampling distribution for 𝒓. We might be interested in knowing the critical value 𝑐 at which the probability 𝑟 is above it is 5%, i.e. the point at which any correlation seen is considered to be significant (were we assuming any correlation there is, is by chance) 5%

Correlation Coefficient Table Formula booklet (note 𝜌=0, i.e. we’re always assuming no correlation in S3) In our demo our sample size was 𝑛=10 and 𝜌=0 Determine: The critical region at which we have a significant positive correlation (significance level 5%) 𝑟>0.5494 Critical region at which we have a significant correlation (significance level 5%) 𝑟<−0.6319 𝑜𝑟 𝑟>0.6319 Critical region at which we have a significant correlation (significance level 1%) 𝑟<−0.7646 𝑜𝑟 𝑟>0.7646 ? ? ?

Null/Alternative Hypotheses? Example Hypothesis Test The product-moment correlation coefficient between 30 pairs of reactions is 𝑟=−0.45. Using a 0.05 significance level, test whether or not 𝜌 differs from 0. 𝐻 0 :𝜌=0 𝐻 1 :𝜌≠0 Critical region: 𝑟<−0.3610 and 𝑟>0.3610. −0.45<−0.3610 therefore value of 𝑟 is significant. Reject 𝐻 0 and accept 𝐻 1 . There is evidence of some correlation. Null/Alternative Hypotheses? Critical Region? Conclusion?

Test Your Understanding The table shows the BMI (Body Mass Index) of a number of people along with their age. What assumption are we making about the data in order to carry out a hypothesis test on the Product Moment Correlation Coefficient? Carry out a suitable hypothesis test at the 5% level that age and BMI are correlated. Age 26 30 31 50 42 BMI 18 21 20.5 24 17 That age and BMI are normally distributed. 𝑯 𝟎 :𝝆=𝟎 𝑯 𝟏 :𝝆≠𝟎 𝒓=𝟎.𝟒𝟓𝟓 𝟎.𝟒𝟓𝟓<𝟎.𝟖𝟕𝟖𝟑 therefore do not reject 𝑯 𝟎 . There is insufficient evidence for correlation between age and BMI. ? ?

Hypothesis Testing with Spearman’s Rank Why do you think we might use a different table for hypothesis testing with Spearman’s rank? For the PMCC, the distribution of 𝒑(𝒓) was produced by repeatedly sampling from two (jointly) normally distributed variables and taking the PMCC each time. i.e. The variables are assumed to be normally distributed. But with data for which we’re calculating 𝒓 𝒔 , the data in each variable is always 1 to 𝒏! ? Sampling distribution of 𝑟 𝑠 when the sample size is say 2 or 3: 𝑛=2 𝑛=3 Suppose we ordered first set of data 𝑥 so that ranks are 1, 2 Suppose we ordered first set of data 𝑥 so that ranks are 1, 2, 3 Possible 𝒀 𝒓 𝒔 (1,2,3) 1 1,3,2 0.5 (2,1,3) (2,3,1) −0.5 (3,1,2) (3,2,1) Possible samples for 𝒀 𝒓 𝒔 (1,2) 1 2,1 −1 ? ? ? 𝒓 𝒔 −1 −0.5 0.5 1 𝑝 𝑟 𝑠 1 6 2 6 ? 𝒓 𝒔 −1 1 𝑝 𝑟 𝑠 1 2 ?

Example ? ?

Test Your Understanding Edexcel S3 June 2011 Q2 𝒓 𝒔 =𝟎.𝟓 (we found this earlier) ?

YOU HAVE REACHED THE END OF MATHS WELL DONE