Hypothesis testing Dr David Field

Hypothesis testing Dr David Field
Today’s lecture is a mixture of statistics and issues of scientific research design that you need to be familiar with in order to use the statistics that I will talk about next week. Dr David Field

Summary Null hypothesis and alternative hypothesis
Statistical significance (p-value, alpha level) One tailed and two tailed predictions What is a true experiment? random allocation to conditions Outcomes of experiments Type I and Type II error Interpreting 95% confidence intervals are two samples from the same population? Today’s lecture covers topics which you need to know about before you start using statistics for real in your own research. As such it does not contain any mathematical calculation of statistics, but it contains the foundations for some next week

Comparing two samples Lectures 1 and 2, and workshop 1 focused on describing a single variable, which was a sample from a single population Today’s lecture will consider what happens when you have two variables (samples) The researcher usually wants to ask if the two samples are from the same population or two different populations? We’ll also consider examples where there is a single sample, but two variables have been measured to assess the relationship between them

Maths exam performance
Quite often you will see data like this presented on the news. There are two samples, boys and girls in this case, and the means of some dependent variable, such as scores on a maths test are compared. The conclusion on the news is usually that girls are now better than boys at maths. By doing this the news article is presenting point estimates of population means without any interval estimates, so we don’t really have a good idea where the means might lie. It is statistically very unsophisticated to compare means without assessing the sampling error, and whether the differences you are observing are likely to be real or due to sampling error.

Having only had two lectures and a workshop you are in a position to report data in a way that is more sophisticated than most news programs. You know how to produce 95% confidence intervals around the two sample means. You can plot these on the graph. If the confidence intervals on the graph looked like this indicates that given the sample mean and the variation in the sample for boys, there is quite a range of values that the population mean can fall within. There is a similar range for girls. Importantly the two ranges overlap, so it is entirely possible that given the samples the two population means are the same, or very close, or even that the population of boys actually has a higher maths performance than that of girls. Adding confidence intervals to this graph makes us more cautious about the interpretation of the pattern of the means than we might be if we saw the graph without the confidence intervals. We would not want to say for sure that the two populations differ in maths ability based on the graph with error bars. There is a suggestion that this might be the case, but we’d need to perform an inferential statistical test to be sure, which will be covered in the next lecture. So, given a graph like this with error bars the correct conclusion is just that we can’t be sure if the two samples reflect two underlying populations of maths ability or just one population in terms of maths ability.

If we added confidence intervals to the graph and they looked like this (not overlapping on the y axis) then we could be very confident that the samples of boys and girls represent two different underlying populations of people in terms of maths ability.

Interpreting confidence intervals on graphs
If the 95% confidence intervals for two means do not overlap then we treat the difference between the means as real (reliable / significant / the null hypothesis can be rejected) These terms will be explained shortly If the 95% confidence intervals around two means do overlap, there might be a real difference, but the graph does not itself establish this To decide, an inferential statistical test is required (t test lecture) Warning: some journal articles plot 1 SE rather than 95% confidence on graphs Watch out for this as 1 SE is effectively a 68% confidence interval rather than a 95% confidence interval In this case the rule at the top of this slide does not apply

Hypothesis and null hypothesis
1.7.5 Imagine some researchers have a theory that eating fruit and vegetables improves brain function They hypothesize that people who eat more fruit and vegetables will perform better in exams The null hypothesis is that there will be no relationship at all between fruit and vegetable consumption and exam performance The null hypothesis is required in order to set up statistical tests that can find support for the hypothesis The null hypothesis is very exact, it means exactly no relationship This exact property allow the null hypothesis to be used to set up an imaginary “null distribution” for statistical purposes The hypothesis itself is often referred to as the “alternative hypothesis” because if you can show that the null hypothesis is false then this is evidence for the alternative Because 95% confidence intervals only have a consistent interpretation when they do not overlap each other, we need to move towards some more direct and specific way of making a decision about whether two samples, given their respective means, have been drawn from the same or two different populations. Notice that the hypothesis (bullet 2) is more specific than the theory (bullet 1), which is more general. Theories should be more general than the hypothesis they generate.

Imagine the researchers test their hypothesis by sampling 12 students
This graph is a scatterplot In the sample, exam performance increases as fruit & vegetable consumption increases. This type of graph is called a scatter plot. In the frequency histograms you are used to looking at the variable of interest is on the x axis, and the y axis plots frequency.In a scatter plot there are two variables of interest, and one is plotted against the other. To produce a scatter plot you need to have two variables, both measured in the same sample of people. Here we are plotting grams of fruit against exam marks in percent. 4.8.1 Scatterplot:

Visually, the evidence in the previous slide is strong, but it is based upon a small sample of 12 individuals We need a way of quantifying our confidence that are that the pattern in the sample is a true reflection of the pattern in the population

null population distribution
random sample null population distribution population? In order to quantify our confidence that the pattern in the sample is true of the population we first assume that the sample is randomly selected from the population. Visually, it seems very plausible that the sample at the top came from the population on the right (an alternative hypothesis population). Visually it seems very unlikely that random selection of cases from the population on the left would produce the sample at the top of the slide. However, if you consider what might happen if you repeatedly drew small samples at random from the population on the left, you should be able to visualise the fact that the random sampling process would occasionally throw up a sample pattern like the one at the top of the slide even if the true population had no relationship. Clearly, seeing a relationship in a sample when there is not really a relationship in a population is more likely to occur when samples are small, which is often the position we find ourselves in when conducting psychological research. (This argument should remind you of the arguments about sampling distributions I made when we are trying to set up a confidence interval around a sample mean value. But now we are thinking about the relationships between two variables in a sample that might be seen by chance if there was no real relationship) The population distribution on the left is important and is given a name in statistics – the null distribution. The null distribution is implied by the null hypothesis. The null distribution is a a theoretical entity. It does not actually exist, and we can’t measure it. What we can do is to calculate the probability of the observed data occurring if the null population distribution were true. This might seem like an odd thing to want to do. Are we not more interested in the probability of the distribution on the right being the true population distribution, rather than knowing the probability that the null distribution is the correct description of the population? Well, yes we are in principle. However, with the null distribution you can know exactly what its properties are, but the alternative population distribution that the researchers are hoping to find evidence for can only very rarely (perhaps never) be quantified exactly in the way the null distribution can. It is this lack of exactness that prevents inferential statistics from directly calculating the probability that the hypothesis is true. As an interesting aside, note that in the population on the right not all the variation exam scores is determined by fruit and veg consumption. If you take a particular value on the fruit and veg consumption from the x axis, then you can find a number of different exam grades. In this simplistic example it is drawn that way to reflect the fact that exam grades are determined by many factors, not just fruit and veg consumption. That extra variation in the underlying population scores that is not due to the hypothesis under examination is one of the reasons why you can’t easily generate an exact alternative hypothesis population distribution.

Null hypothesis testing
In an ideal world, we’d directly estimate the probability that the population conforms to the alternative hypothesis given the sample “We are 95% certain that there is a positive relationship between eating fruit and vegetables and exam performance” This is not possible using classical statistics Which is because there are an infinite number of possible alternative hypothesis population distributions But there is only 1 null population distribution, which makes it possible to calculate the probability that the data could be a random sample from it Note that in 2nd year you will learn about “power”, and when combined with effect size, power analysis can allow something like a direct test of the alternative hypothesis to be made. But first, you have to learn the classical method of null hypothesis testing.

Null hypothesis testing
If the probability that the data could be a random sample from the null distribution is less than 5% (1 in 20) you can reject the null hypothesis as false this indirectly supports the alternative hypothesis, which is never directly tested If the probability that the data could be a random sample from the null distribution is greater than 5% (1 in 20) you fail to reject the null hypothesis failing to reject the null hypothesis is not the same as saying that the null hypothesis is true statistics never allow you to say that the null hypothesis is true

Jane Superbrain 2.5 Why 0.05 (5%, or 1 in 20)?
This is somewhat arbitrary 0.05 is called the alpha level sometimes 0.01 is used instead 0.05 does produce a good balance between the probability of a researcher making Type I error the probability of making a Type II error see later for meaning of these types of error What you need to understand about probability values (p values) is that p = 1 = 100% = certainty p = 0.1 = 10% = 1 in 10 p = 0.05 = 5% = 1 in 20 p = 0.01 = 1% = 1 in 100

sample N = 12 How can we quantify the probability of obtaining a sample like the one we have from the null distribution? Using sampling distributions Imagine drawing a very large number of samples, each with N = 12 from the null distribution A few of them would look very similar to the actual sample from the real population Perhaps 1% of the samples would look like the top left graph. Therefore, the p value of the data would be 1% null population distribution

But what does similar mean?
sample N = 12 But what does similar mean? Very few random samples from the null distribution would be exactly the same as the sample obtained from the real population In this example, what defines the null distribution is that there is no relationship at all between exam performance and fruit consumption A statistic called a correlation coefficent quantifying the strength of the relationship between two variables can be calculated It has a value of 0 for the null distribution null population distribution

But what does similar mean?
sample N = 12 But what does similar mean? For each sample the correlation statistic can be calculated Two samples can both have a correlation of 0.5 between exam performance and fruit consumption without being identical to each other Therefore, the null distribution is defined in terms of values of statistics (like the correlation coeffient) If the obtained sample has a correlation of 0.5 you can calculate the p of a single sample from the null distribution having a correlation of 0.5 or higher If p < 0.05 you would reject the null hypothesis Details of statistics that can be converted to p values are covered in later lectures null population distribution

One tailed and two tailed hypotheses
2.6.2 In the example the researchers predicted that exam performance would improve as fruit and vegetable consumption increases This is a one directional hypothesis (one tailed) Another group of researchers, funded by a junk food manufacturer, might predict the opposite It is also possible to predict that one variable will influence another, without specifying a direction e.g., people who eat a lot of fruit and vegetables will perform differently in exams than people who eat a small amount of fruit and vegetables This is a two tailed hypothesis

Where do the names “one tailed” and “two tailed” come from?

The statistics we use to measure how different the sample we have is from the null hypothesis of 0 difference (or no relationship), tend to have probability distributions a bit like the standard normal distribution. These are the distributions that occur assuming the null hypothesis is true and that differences individual samples have from the null hypothesis arise only due to sampling error. Larger values of the statistic are less likely to happen purely due to sampling effect, which is why they are rarer and in the tails of the distribution. When you make a 2 tailed prediction you are concerned with both tails of the probability distribution, coloured in black here. It is because of this concern with the tails of probability distribution that we use the words one tailed and two tailed hypothesis You are concerned with the probability that the difference between the sample data and the null hypothesis could occur by random sampling from the null distribution, and you don’t care whether that is a positive difference above the null hypothesis of zero, or a negative difference below it.

When we make a one tailed prediction we are only concerned with a single tail of the distribution. This means we can use a bigger part of that single tail to reach the overall probability of 5% obtaining the data due to random sampling from a null distribution

This slide and the next one should be referred to when writing lab reports SPSS, the program you will use for statistical analysis always reports two tailed significance levels If you have a one tailed hypothesis you can divide the significance value SPSS gives you in half p = 0.08 becomes p = 0.04 It is important to do this, as many results with small samples will be significant on a one tailed test but not a two tailed test 0.08 > 0.05 (fail to reject null), 0.04 < 0.05 (reject null) But, do not divide the value of the statistic (e.g., “t” or “r”) SPSS reports in half

Reporting statistical significance
Is the p-value > 0.05? Remember to divide by 2 first if one tailed If the answer to 1) above is “yes” then you can write “t(29) = 1.2, NS” you will learn where the 29 and the 1.2 come from in the t test lecture NS stands for “non significant” If the answer to 1) above is “no” then you can write something like “t(29) = 4.3, p = 0.03” the value of p written here is the same number you tested in 1) above In this case you are reporting a statistically significant result This way of reporting is called “reporting exact p values” Reporting exact p values is good because it communicates to the reader what the exact probability of your data being obtained under the null hypothesis is. The old convention was to just say that p was less than 0.05, or 0.01, e.g. t(29) = 4.3, p < You will still see this in journals, but it is really a hangover from the days before computers when people had to use statistical lookup tables in books after they had computed inferential statistics by hand. You may also see non significant results reported as > 0.05 instead of NS in journals. However, I think that people get confused between the greater than and less than signs <> so in your lab reports the rule is that you will avoid their use entirely and stick to NS or an exact p value for a statistically significant result.

Meaning of “statistical significance”
Jane Superbrain 2.6 “Significant” does not mean “important” Significance is just the probability of obtaining a result as extreme or more extreme than the sample data you have assuming the underlying population conforms to the null distribution, in which the mean is zero Recall from lecture 2 and workshop 1 that the SE and 95% confidence interval around a sample mean reduces as sample size increases Large samples will have very small SE making it is very easy to achieve p of null hypothesis < 0.05 With a large sample, if the null hypothesis were true, then producing a result even slightly different from the null hypothesis by random sampling is very unlikely For example, if you sample maths scores for 1000 boys and 1000 girls null hypothesis is that boys and girls score the same If you get a mean of 68% for girls and 67% for boys, this small difference can easily reach statistical significance with N = 2000 But such a difference is not important In the lecture on t tests I’ll introduce the idea of how to calculate “effect size”, which is a way of addressing how large and important the difference or relationship you are observing is independently of its significance level.

Experiments In the fruit and vegetables example the researchers randomly sampled some students and measured two variables the relationship was plotted Next term you will learn how to calculate “correlations” to statistically describe this kind of data In the boys and girls maths performance example the researchers compared a random sample of boys with a random sample of girls they were looking for a difference between groups Neither of these research designs constitute a true experiment

Experiments In the fruit and vegetables example exam performance increased as a function of fruit and vegetable intake but the researchers did not manipulate the amount of fruit and vegetables eaten by participants perhaps fruit and vegetable intake increases as exercise increases and perhaps exam performance also increases as exercise increases exercise is a third variable that might potentially cause the changes in the other two You can’t infer causality by observing a relationship (correlation) between two variables To turn this fruit and vegetable study into an experiment it would be necessary for the researchers to decide how much fruit and veg each of the participants must consume each week. If you just observe the values of variables in the world you don’t know what other variables they are related to or caused by.

Experiments The boys versus girls case seems more clear cut, but in reality this is still a correlational design The researchers were not able to decide if each participant would be male or female, they just come that way This opens up the possibility that the male / female dichotomy might be correlated with a third variable that is the true explanation of the difference in maths between boys and girls For example, perhaps shorter people are better at maths, and girls are shorter than boys on average Height is a confounding variable because it can potentially provide an alternative explanation of the data that competes with the researchers hypothesis This is an implausible example, but it makes the point that the researcher is not really in control of the experiment when comparing groups that are predefined such as boys vs girls, old vs young, If you find that girls are better than boys at maths tests you can’t logically rule out the possibility that the real cause of the difference is that shorter people are better at maths. To rule it out you’d need to do two more studies, one with boys and one with girls to rule out the alternative explanation of the data offered by the confounding variable (height). Remember, the scientific method is founded on using (extreme) scepticism to test theories to destruction…. Fortunately, in a true experiment where the researchers manipulate the independent variable (IV) and use random allocation of participants to levels of the IV, most confounds are immediately ruled out and the results have a clear interpretation in terms of the IV causing changes in the dependent variable (DV)

Experiments In a true experiment, the researchers can manipulate the variables, e.g. they decide how much fruit and vegetables each participant eats Things that researchers can manipulate are called independent variables (IV) The thing that is measured because it is hypothesized that the independent variable has a causal influence on it is called the dependent variable (DV) In a true experiment participants are almost always randomly allocated to conditions.

Random allocation 1.6.3 Researchers think that supplementing diet with 200 g of blueberries per day will improve exam performance compared to equivalent calories consumed as sugar cubes But exam performance will also be influenced by other factors, such as IQ, and number of hours spent studying If each participant in the total sample is randomly allocated to blueberries or sugar cubes, then with enough participants the mean IQ in the two samples and the mean number of hours studied will turn out to be about equal because these two variables are “equalised” across the two levels of the IV by randomization they will not contribute to any difference in mean exam scores between the sugar cube and blueberry groups Random allocation to conditions even protects the researcher against the influence of confounds he/she has not thought of! If the blueberry group have higher exam scores than the sugar cube group the difference must be caused by the IV Tu use Andy Field’s terminology, the IV produces systematic variation between the two experimental groups, and all other sources of variation are called unsytematic variation. What randomization does is to ensure that things like IQ are not systematically related to the IV, so that they can only contribute to the unsystematic variation.

Random allocation IQ and number of hours studied will not influence the mean exam score in the blueberry group or the sugar cube group We will be able to plot two frequency histograms of the exam scores, one for each group and calculate the SD Can IQ scores and hours spent studying influence the SD of the scores in the two groups? imagine we run the experiment once using a sample containing great variation in IQ and study hours imagine we run the experiment again using a sample selected so that the IQ’s only vary between 100 and 110, and everyone has similar studying habits What implication could this have for the ability of the experiment to produce a statistically significant result? IQ and hours studied will both contribute to the variation in exam scores in both experimental groups. This means that they both have the effect of increasing the SD (which will increase what Andy Field calls unsystematic variation). Because the SE depends on sample size, but also depends on the SD, it means the SE will be bigger too. With a bigger SE it is harder to find a statistically significant difference between conditions. You can imagine that the confidence intervals on a graph are more likely to overlap between the two conditions if the SE is bigger. So, in the experiment where we select people to have a narrow range of IQ and similar studying habits, and then randomly allocate them to the blueberry and sugar cube conditions you are going to get smaller SD of exam scores in both experimental groups. This means you can find a statistically significant effect of blueberries using less participants. Sometimes it helps to try and minimise the unsystematic variation in the experiment. Randomization prevents confounding variables from influencing the means, but they still contribute to variation in scores, which we have to designate as measurement error in our statistics. Big measurement error is bad news because it makes it hard to find statistically significant differences.

Blueberries mean 57.3 SD 10.6 % N = 29 Sugar cubes 51.4 % SD 12.3 %
This is what the results of the experiment might look like with 29 participants in each group. We can see that the blueberry group have done better in their exams. However, there is a lot of “UNSYSTEMATIC” variation in the data as can be seen by the fact the SD is a large proportion of the mean for both groups. I have calculated 95% confidence intervals for these two groups, and they OVERLAP, and therefore we could not use them as an argument to reject the null hypothesis. We’d need to perform an inferential statistical test called a between groups t test to tell us the probability of obtaining a difference of 6% between the two groups if the two groups were really drawn from the same null distribution. That t test also takes into account the large amount of variation within each group, as well as the difference in the mean exam scores between groups. Note, if you do run a t test on this it is NS

The null hypothesis in a true experiment
Begin with a single sample from a single population Randomly divide the sample between the two levels of the IV You now have two samples The hypothesis is that the IV is successful in causing the two samples to come from two statistically separate populations (e.g. of exam scores) The null hypothesis is that the two samples remain as samples from a single population (e.g. of exam scores)

Outcomes of Experiments
2.6.3 experiment outcome reality in population null is false null is “true” reject null fail to reject null p < 0.05 p > 0.05 NS By convention we reject the null if the p value of the data assuming the null distribution is less than 5%. Otherwise we fail to reject it. It is the existence of this cut off value of 5% that implies the existence of this 2 by 2 grid NS stands for non significant

experiment outcome reality in population null is false null is “true” reject null true positive fail to reject null true negative

experiment outcome reality in population null is false null is “true” reject null true positive false positive (Type I error) fail to reject null true negative

experiment outcome reality in population null is false null is “true” reject null true positive P value of experiment IS the probability of a Type I error fail to reject null true negative Note P value of experiment also called significance level or alpha

experiment outcome reality in population null is false null is “true” reject null true positive P value of statistic is probability of a Type I error fail to reject null false negative (Type II error) true negative Type two error is sometimes referred to as beta, and you may see the Greek letter β “beta”.

experiment outcome reality in population null is false null is “true” reject null true positive P value of statistic is probability of a Type I error fail to reject null Probability of Type II error cannot be assessed true negative With a single experiment that fails to reject the null hypothesis you cannot assess the probability that the null hypothesis is really false and you have incorrectly failed to reject it. The only way to really assess the probability of a type II error is to replicate an experiment many times and work out the reliability. You will study this is year 2.

Relationship between Type I and Type II error
Conventionally, we use 0.05 as a threshold (or cut off, or criterion) to decide whether we reject the null hypothesis or not A researcher can use a more conservative, stricter, threshold, such as 0.01 (1%) this reduces the chance of a researcher publishing a Type I error but it increases the chance of a Type II error The only way to find out if an experimental result is a Type 1 error is to replicate (repeat) it p of two consecutive type I errors is 0.05 * 0.05 = One reason that the 0.05 alpha level is conventionally adopted is because it produces a good compromise between the probability of a Type I error and the probability of a Type II error occurring

Hypothesis testing Dr David Field

Similar presentations

Presentation on theme: "Hypothesis testing Dr David Field"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hypothesis testing Dr David Field

Similar presentations

Presentation on theme: "Hypothesis testing Dr David Field"— Presentation transcript:

Similar presentations

About project

Feedback