# Statistics for A2 biology Statistics is..... The branch of mathematics dealing with... Uncertainty, or Probability.

## Presentation on theme: "Statistics for A2 biology Statistics is..... The branch of mathematics dealing with... Uncertainty, or Probability."— Presentation transcript:

Statistics for A2 biology Statistics is..... The branch of mathematics dealing with... Uncertainty, or Probability

Statistics for A2 biology Checking for Patterns: the Villagers Two groups of men from separate villages. One group looks taller. Is there a real difference? Could these just be two samples from the same group, with sampling error? What is the chance that both samples could have come from the same population?

Statistics for A2 biology Hypotheses: support or reject In Science, we can’t prove that there is a difference between the two villages. We make predictions (hypotheses), carry out experiments and examine the data. We can then say one of two things:  The hypothesis is supported by the data,  The hypothesis is not supported and can therefore be rejected. When using statistics, we make two hypotheses:  The Null Hypothesis, H 0 proposes that there is no pattern at all, e.g. the observed difference in heights in the two villages is a result of sampling error. Both villages are part of the same population.  The alternative hypothesis, H 1 proposes that there is a pattern, e.g. the observed difference in heights in the two villages can not be accounted for by pure chance.

Statistics for A2 biology Checking for Patterns: the light-seeking fleas fleas are tested at various times after hatching we record how many move towards the light (positive phototaxis) there seems to be a clear positive correlation what is the probability that this apparent pattern is a result of pure chance?

Statistics for A2 biology Hypotheses for the flea experiment  The Null Hypothesis, H 0 proposes that there is no pattern at all: the apparent correlation between time of hatching and percentage of positively phototaxic fleas is the result of pure chance.  The alternative hypothesis, H 1 proposes that there is a pattern: the apparent correlation between time of hatching and percentage of positively phototaxic fleas cannot be accounted for by pure chance.

Statistics for A2 biology Checking for patterns: genetic ratios Gregor Mendel, the founder of genetics, crossed two pea plants with green pods. Both were heterozygous for the recessive characteristic yellow pods. In the next generation 428 plants had green pods and 152 yellow pods. According to the theory, the expected ratio is 3:1. The actual ratio is 2.82:1. Could this deviation from the ideal ratio be produced by pure chance or is there a significant deviation, so that there is not adequate support for the hypothesis?

Statistics for A2 biology Hypotheses for the genetics cross  The Null Hypothesis, H 0 proposes that the deviation from the expected 3:1 ration could have been produced by pure chance.  The alternative hypothesis, H 1 proposes that the deviation from the expected 3:1 ration could not have been produced by pure chance – in other words, it is not a 3:1 ratio.  These hypotheses may seem rather strange, as the expected ratio is in the null hypothesis. This time we want support for the null hypothesis, as this means that the interpretation of a 3:1 ratio is correct.  The golden rule is not broken - null hypotheses always propose that variations are produced by chance.

Statistics for A2 biology The Villagers What do you notice about the ranked men? We ask the men to stand in order of height.

Statistics for A2 biology The villagers (2) 1) We give a number for the rank of each. 1 1 3 3 5 5 7 8 9 9 9 12 13 16 19 20 2) With equal ranks, we give the average of the sequence of equals 1.5 3.5 5.5 7 8 10 12 14 17 19 20 3) Collect the ranks for the two villages into two groups BLUE VILLAGE: 1.5, 3.5, 5.5, 7, 10, 12, 14, 17 ORANGE VILLAGE: 5.5, 8, 10, 14, 17, 19, 20 Average ranks are quite different (BLUE 7.55; ORANGE 13.45), but could this happen by chance?

Statistics for A2 biology The Mann-Whitney Test (1) Average rank for BLUE is 7.55; Average rank for ORANGE 13.45; This looks pretty convincing, but to be sure, we have to allow for the size of samples as a big difference in ranks gives more certainty when the sample is big. First we calculate total of ranks in each group: R1 =  (ranks for blue village) = 75.5 R2 =  (ranks for orange village) = 134.5 , the Greek letter sigma means “the sum of” Now, we calculate two values for Mann-Whitney’s U, as follows: U1=n1.n2 + 0.5n1 (n1 + 1) - R1 U2 = n1.n2 + 0.5n2 (n2 + 1) - R2 (n1 & n2 are the numbers in the samples, in this case both 10)

Statistics for A2 biology The Mann-Whitney Test (2) R1 =  (ranks for blue village) = 75.5 R2 =  (ranks for orange village) = 134.5 Now, we calculate two values for Mann-Whitney’s U, as follows: U1=n1.n2 + 0.5n1 (n1 + 1) – R1  U1=(10x10) + 0.5x10(10-1) – 75.5 = 100 + 45 – 75.5 = 69.5 U2 = n1.n2 + 0.5n2 (n2 + 1) - R2  U2=(10x10) + 0.5x10(10-1) – 134.5 = 100 + 45 – 134.5 = 10.5 The lowest of the two values of U counts for the next stage: significance testing.

Statistics for A2 biology The Mann-Whitney Test (3) Firstly let us return to the Hypotheses for this problem:  The Null Hypothesis H 0 : the observed difference in heights in the two villages is a result of sampling error. Both villages are part of the same population.  The alternative hypothesis H 1 : the observed difference in heights in the two villages can not be accounted for by pure chance. We look up the smallest value for U (10.5) in a significance table. This gives us the probability for the null hypothesis.

Statistics for A2 biology The Mann-Whitney Test (3) We look up the smallest value for U (10.5) in a significance table. This gives us the probability for the null hypothesis: that the observed difference in heights in the two villages is a result of sampling error. Both villages are part of the same population. this says that this table is for a probability of 5% or p=0.05 we look up the number with the correct values for n1 and n2 if our calculated value for U is less than or equal to the critical value, then the probability of the null hypothesis is less than 5% (p<0.05)

Statistics for A2 biology The Mann-Whitney Test (4) The probability (p) that the observed difference in heights in the two villages is a result of sampling error and that both villages are part of the same population is given by: p<0.05 We therefore reject the null hypothesis. We can say that there is a significant difference between the heights of men in the two villages. We can say that the difference is significant at the 5% level. (In fact the value for U is so far below the critical value for the 5% significance level, that it is very likely, it is even more significant. To find out, we would need a different table for the greater level of significance).

Statistics for A2 biology The advantage of replicates Note how the critical values increase with bigger samples. As a significant difference requires a low number, doing many replicates makes it easier to demonstrate a significant difference

Statistics for A2 biology Tailed-ness in the U test In a two tailed test, your alternative hypothesis simply proposes that there is simply a difference between the two groups compared. It proposes that the groups are different but not which group has the highest values. In a one tailed test, your alternative hypothesis proposes that there is a a difference between the two groups compared, with a definite direction (either the first or the second group has the highest values). In a good experiment, the investigator should be able to make a prediction with direction, and one- tailed tests are the rule. One and two tailed tests

Statistics for A2 biology testing for correlation (1) As with the villagers, the statistical test used for the flea example uses ranking, but in a different way. Check out this graph showing “perfect” correlation. Now, give each data point a rank on the x axis... And then on the y axis. 1 23456 1 2 3 4 5 6 Now make a table of pairs of rankings... x ranks123456 y ranks123456 There is a perfect match.

Statistics for A2 biology testing for correlation(2) Now, check out this graph showing a less than perfect correlation. Again, give each data point a rank on the x axis... And then on the y axis. 12345= 1 2 3 4 5 6 Now make a table of pairs of rankings... x ranks12345= y ranks132556 The rankings no longer match perfectly.

Statistics for A2 biology testing for correlation (3) Now, check out this graph showing no apparent correlation. Again, give each data point a rank on the x axis... And then on the y axis. 123456 1 2 3 4 5= Now make a table of pairs of rankings... x ranks123456 y ranks5=241 3 There are a lot of mismatches in the rankings.

Statistics for A2 biology testing for correlation (4) Spearman’s rank correlation test (1) Spearman’s test starts with a table comparing rankings on the x and y axes, It gives a single number, called the correlation coefficient, The symbol for this is r, r is in the range +1.0 to -1.0, when r = +1.0, this means a perfect positive correlation, with the line of best fit going up from bottom left to top right, and the rankings the same on both the x and the y axis, when r = -1.0, this means a perfect negative correlation, with the line of best fit going up from top left to bottom right, and the rankings exactly the opposite on the x and the y axis, When r = 0, there is no correlation When r is between 0 and either –1 or +1, there is a weaker correlation. r = +1.0 r = -1.0 r = 0.0 r = +0.8 r = -0.6

Statistics for A2 biology Spearman’s rank correlation test (2) Calculating the correlation coefficient, r hrs after hatching % photo- taxic 53 185 2512 317 4217 5022 8835 6437 7246 8037 x ranks (r x ) 1 2 3 4 5 6 7 8 9 10 y ranks (r y ) 1 2 4 3 5 6 7 8.5 10 8.5 d (r x - r y ) 0 0 1 0 0 0 -0.5 1.5 d2d2 0 0 1 1 0 0 0 0.25 1 2.25 n = 10  d 2 = 5.5 tabulate data rank for x square this calculate difference between ranks rank for y sum of squared deviations number of pairs

Statistics for A2 biology Spearman’s rank correlation test (3) Calculating the correlation coefficient, r (cont.) r is calculated according to this equation:

Statistics for A2 biology Spearman’s rank correlation test (3) The points are “all” on the best fit line, The ranks are the same for “all” points So the correlation coefficient is: (?) r = + 1.0 But what does this mean? Actually, nothing because you can always draw a straight line through two points Significance of the correlation coefficient, r Consider this graph: Add a best fit line: then the correlation looks “safer”, but if it comes here (green dot).. If the next point comes here (blue dot)... then the correlation looks highly unlikely

Statistics for A2 biology Significance of the correlation test (2) We look up the value for r (+0.817) in a significance table. This gives us the probability for the null hypothesis: that the apparent correlation on the graph could have been obtained by pure chance. The value of r is greater than the critical value for p = 0.001 (2- tailed test) or p = 0.005 (1-tailed test). So the null hypothesis is very unlikely and we have excellent support for the alternative hypothesis. We see where our calculated value for r fits on the line. In this case it is to the right of the biggest number. we find the correct line in the table. (the Greek letter nu) is the number of data points: in this case 10

Statistics for A2 biology Significance of the correlation test (3) In a two tailed test, your alternative hypothesis simply proposes that there is some sort of a correlation between the variables x and y... It proposes that the variables are linked but not whether it is a positive or a negative correlation. In a one tailed test, your alternative hypothesis proposes that there is a correlation between the variables x and y with a definite direction (either positive or negative), In a good experiment, the investigator should be able to make a prediction with direction, and one- tailed tests are the rule. One and two tailed tests

Statistics for A2 biology Significance of the correlation test (4) We return to the original null hypothesis and give its probability... The probability (p) that the observed positive correlation between the age of the fleas (in hours since hatching) and % positive phototaxis could have been produced by purely random variation is given by p < 0,005 (or p < 0.5%)... We then give the “other side of the coin”: the support for the alternative hypothesis...... so that the hypothesis that phototaxic behaviour of fleas is related to the age of the fleas is supported at the 0,5% level. Final conclusion on the flea experiment

Statistics for A2 biology The statistical advantage of thoroughness Look down the column for p = 0.01 (two-tailed). With bigger samples, the critical value becomes smaller, This means it is easier to show that there is a correlation, Unfortunately, this means spending more time collecting data.. but it is worth it, to get a conclusive result.

Statistics for A2 biology Problems with correlation: 1 non-linearity The data in the table relate number of reptile and amphibian species to the size of an oceanic island. area / km 2 species count 35 187 1289 24513 113020 130018 563250 621345 4540585 5306078 (r x ) 1 2 3 4 5 6 7 8 9 10 (r y ) 1 2 3 4 6 5 8 7 10 9 The rankings are very similar and clearly, this will be a significant correlation..... r = 0.964, with = 10, p < 0.001 But look at the graph! This is not a linear relationship; the graph looks right when both scales are logarithmic

Statistics for A2 biology Problems with correlation: 2 “rogue points” Spreadsheets use statistical techniques to calculate the equation for the “best fit line”. But a single “rogue point” (ringed) can distort the line considerably. The true line is more like the one shown in red. It is often better to draw the line yourself. You need to decide which points to ignore, and whether the relationship is linear.

Statistics for A2 biology Problems with correlation: 3 Consider this graph: IS CORRELATION THE SAME AS CAUSATION? Nobody would suggest that an increase in numbers of churches, mosques, synagogues and other places of worship cause an increase in public houses. They are both related to a third variable... The size of the community.

Statistics for A2 biology Problems with correlation: 4 SMOKING AND LUNG CANCER Even the tobacco companies cannot deny the correlation between cigarette consumption and risk of lung cancer. But they have brought the idea of causation into question....... suggesting a third variable which has nothing to do with smoking, e.g. a certain gene, which has two effects – one to increase risk of cancer and two to make a person more likely to take up cigarette smoking, HIV AND AIDS A very controversial hypothesis suggested that the presence of particles of the human immunodeficiency virus (HIV) were not the cause of AIDS but just another symptom. The true cause was suggested to be the reckless and irresponsible life- style of the patient. IS CORRELATION THE SAME AS CAUSATION?

Statistics for A2 biology Genetic Ratios: are deviations significant? Mendel crossed two pea plants with green pods. Both were heterozygous for the recessive characteristic yellow pods. In the offspring, 428 plants had green pods and 152 had yellow pods. The expected ratio is 3:1. The actual ratio is 2.82:1. Could this deviation from the ideal ratio be produced by pure chance or is there a significant deviation?  2 testThis is a job for the  2 test (chi squared). This test compares actual numerical patterns with expected patterns and gives the probability that chance could have caused the deviations.

Statistics for A2 biology Checking Genetic Ratios with  2 O 428 152 EO - E -7 7 (O – E) 2 49 (O – E) 2 / E 0.113 0.338 Enter the observed values into the first column of a table (O = observed), Calculate the values expected for a “perfect ratio”: total offspring = 580; ¾ of this is 435 and ¼ is 145, Enter these values in the second column (E = expected), 435 145 In the third column, calculate deviations from the expected values (O – E), Square this value in the third column, and in the final column divide by the expected.  2 is the sum of the final column 0.451  2 =

Statistics for A2 biology Checking Genetic Ratios with  2 : 2 THE SIGNIFICANCE TEST The value of 0.451 for  2 does not mean anything yet. First, we must look up the value in significance tables, As with the Spearman’s rank table, there are lines in the  2 table for different values of the number of degrees of freedom, For  2, this is the number of data items minus one, so in this case = 1, We see where our calculated value fits on this line, It is well below the critical value for p = 0.05, so we give the probability of the null hypothesis as: p > 0.05

Statistics for A2 biology Checking Genetic Ratios with  2 : 3 THE SIGNIFICANCE TEST: 2 What does this probability mean? Let us return to the null hypothesis:  The Null Hypothesis, H 0 proposes that the deviation from the expected 3:1 ration could have been produced by pure chance. As the probability is greater than 5%, then we cannot reject the null hypothesis! At first, this looks like a failure, until we realize that this is just what we want: There is no significant deviation from a 3:1 ratio, so we can accept the alternative hypothesis that this is a “good” 3:1 ratio.

Statistics for A2 biology NO tailed-ness in the  2 test As hypotheses for tests predict “fit” or “no-fit” and have no direction, there are no one-tailed or two- tailed tests. One and two tailed tests

Statistics for A2 biology PURPOSEWHICH TEST?REQUIRES Statistics: which test? To compare two groups, e.g. heights of trees from different woods, or speed of breakdown of protein by two different enzymes. To check for correlation between two variables e.g. effect of temperature on metabolic rate To check for goodness of fit to a numerical pattern, e.g. are woodlice randomly distributed in a choice chamber? THE MANN- WHITNEY U TEST SPEARMAN’S RANK CORRELATION TEST THE  2 TEST At least 6 in each group  in an experiment 6 replicates! different sources give minimum between 8 and 15 data points 2 numbers

Download ppt "Statistics for A2 biology Statistics is..... The branch of mathematics dealing with... Uncertainty, or Probability."

Similar presentations