Presentation on theme: "Measure your handspan and foot length in cm to nearest mm We will record them as Bivariate data below: Now we need to plot them in what kind of graph?"— Presentation transcript:
Measure your handspan and foot length in cm to nearest mm We will record them as Bivariate data below: Now we need to plot them in what kind of graph? Go on then!......accurately correlation 1.xls
Before adding a line of best fit it is sensible to consider if there should be one in the first place. At GCSE we just looked at the scattergraph and decided visually whether the correlation; existed, was weak or strong. However this is dangerous. Consider the graphs below and state there correlation purely from a visual point of view. corrrelation 2.xls
Because of this statisticians use a numerical value to assess whether the correlation is strong enough to add a line of best fit. A popular choice is the Product Moment Correlation Coefficient (PMCC) This is often just denoted by "r" and is often squared (although this would mean you don't know if it's a positive or negative correlation.) This is calculated using the formula below. It is a lot easier on a spreadsheet or graphical calculator and so in exams they often give you some of the "bits". This is the formula we use practically but this link explains where it has come from and how it relates to your scattergraph points
Once you've calculated the PMCC it needs to be interpreted. Open this spreadsheet and use the graph tool to draw scattergraphs for each. Consider which coloured data has the strongest correlation. Now add a linear line of best fit and consider how close the points appear to the line. Do you still agree with your previous answers? Calculate the r values for each set of data. Now add the r 2 value to each graph. Were you right? Square root these values to find the r value and consider if it's negative or positive.
If r is 1 there is perfect positive correlation (the points form a straight line) If r is -1 there is perfect negative correlation Between -1 and 1 we have, strong weak and no correlation. The closer to 1 or -1 the stronger the correlation and the closer to 0 indicates no correlation between your values. However the more points you have in your dataset the further from 1 it will appear, despite a strong correlation. To interpret r correctly we must also consider how many pieces of data are collected. From your earlier datasets which of the turquoise and orange is strongest according to the r value?
They have almost the same PMCC value. However the orange dataset has a stronger correlation because it is more difficult to get 10 points near to a straight line than 5 points. This weblink gives a table of data you should refer to when considering if a value has a high enough PMCC value to assume a correlation exists. As long as the r value is larger than the one in the table you can be....% sure there is a correlation. Consider the yellow and blue data sets. Only one piece of data has changed. What is the probability these data sets show a correlation?
correlation 1.xls Add a line of best fit (visually) to your graph for feet and hand span data. Use what you have learned on C1 to calculate an equation for this line. Is this line the same as any of your classmates? Why do you think this is? Are you happy you have put your line in the right place? Could you move it and still be happy? What made you put it where you did?
Was your line equation the same as the excel one? Excel calculates this line mathematically rather than by visual judgement. It calculates the vertical distance between each coordinate and the possible line, adds the square of these distances together and then it adjusts the lines position to minimise this value. Why do you think it squares the value?
This seems complicated but there are formulae you can use to do it quickly. To begin with we will consider how each coordinate differs from the mean. Above you can see above how the formulae can be rewritten into an easier form to calculate. Below is how the three parts need to be put together to produce the Product Moment Correlation Coefficient (PMCC) - r from the Excel graphs we considered earlier. S xx = x 2 - ( x) 2 n S yy = y 2 - ( y) 2 n S xy = xy - ( x )( y) n
The formulae for S xx etc... can also be used to find the equation of the line of best fit A straight line is in the form y = a + bx where b is the gradient and found using where y is the mean of the y data and x is the mean of the x data Given the gradient of the line and knowing it should pass through the point (x,y) can easily be calculated
From our class data estimate the hand span of a year 12 SAC student who has a foot length of 30cm. Estimate the foot length of a student with a hand span of 22cm. Redraw the data with handspan on the x axis and draw a line of best fit. Is your answer the same. Download the data in excel and swap the data columns over. What happens to the equation of the line of best fit? Calculate the foot length above using both equations excel gives you. Comment on your results. correlation 1.xls
You will notice that the formula for b uses S xx but not S yy. This is because this formula is only used for a line if best fit required for finding y given a specific x coordinate. It minimises the distance of each point vertically from the line of best fit. If you want to estimate the x value given a specific y value you should use a different line of best fit which minimises the horizontal distance from each point to the line. The formula for the line of best fit is only very slightly different: Use b' and the means of x and y to find a' S xy b' = S yy
The only time we don't use the y = a' + b'x version for estimating x when we know the y value is when the x data is FIXED. If you collect data from an experiment where one value in the data is pre-set we call that FIXED and we must plot that on the x axis and then use the y = a + bx line of best fit for any estimating of values. An example of this might be timing an ice cube melting at certain temperatures. The temperatures used are decided before hand - FIXED - and temperature needs to be on the x axis.
A regression line can be used to estimate the value of any dependent variable for any independent variable. Interpolation is when you estimate the value with thin the range of data using the equation of the regression line (line of best fit) Extrapolation is when you estimate the value with thin the range of data using the equation of the regression line. What do you think are the dangers of either of these techniques and which one would you view most cautiously? Why?