Presentation on theme: "PALA Summer School 2014 Inferential Statistics Willie van Peer Ludwig-Maximilians-University Munich"— Presentation transcript:
PALA Summer School 2014 Inferential Statistics Willie van Peer Ludwig-Maximilians-University Munich
Inferential statistics Also ‘test statistics’ sample ---> population? Tests whether observed results in sample may be generalized to the population. Not as ‘yes’ or ‘no’, but as a probability. Statistics is a discipline in which such probabilities are investigated.
Sample vs. Population Suppose you wish to investigate whether ‘free’ or ‘guided’ reading lessons in school yield different pedagogical results. You will have to make observations, ask questions, in one word: collect data. But impossible to ask ALL pupils in your country. So you make a SELECTION: a sample. But you are not interested in this sample only: You want to go beyond the sample: to the population (here in the statistical sense!)
This is a generalization Beyond the sample. But this can be tricky / dangerous / fatal (!) Gunners in WWII bombers mostly said the attacks came from behind and above. Can one generalize these answers? In order to be able to, the sample has to be representative of the population. Is that condition fulfilled here? Of course it is NOT. Think why!
Suppose you have followed the two instruction methods for reading (‘free’ or ‘guided’) in 4 Maribor schools for 2 months. Your data suggest that the ‘guided’ method yields superior pedagogic effects for boys. Are you able to generalize your findings to: –All boys in Maribor schools? (Most probably) –All pupils in Maribor schools? (Certainly not) –All Slovene boys? (Difficult, maybe) –All Slovene pupils? (Definitely not) –All pupils? (no way)
A paradox The paradox of sampling: you need to know what you are in fact trying to find out… If your sample is not representative, its data will be misleading, but how do you know whether it is representative? Another serious problem: self-selection of participants! To avoid sampling problems when asking people in the street: use random numbers, or accost every 4 th or 5 th person who walks by.
Errors 2 types: –constant errors (E-group in Maribor, C-group in Ljubljana!) –random errors (the weather, the time of the day/year, the general mood in the country, …) Constant errors must be under control at all cost, e.g. through randomization. This does not eliminate errors, but makes them into random errors. Random errors cannot be avoided! When nevertheless we find an effect in the E-group: ‘robust’ effect!
Hence We must estimate how great the probability is that the effect came about through random errors. I.e.: how probable is it that only the unavoidable random errors created the observed effect of the IV? When this is not particularly probable, we decide that the IV had an effect on the DV. But when do we judge something ‘not particularly probable’?
An example We wish to know, whether reading a story with a sad ending is judged more rewarding than reading a story with a ‘happy end’. Imagine we asked 8 people, 7 of whom said they preferred the sad version, and only 1 the happy version How probable is such a result? To investigate this, let us start from the fact that every informant had two possible choices (prefer ‘sad’, or prefer ‘happy’), both therefore having a probability of 50 %.
Probability VP1:+- VP2: possible results (2 2 ): VP1 + VP2 + [+ here means: prefer sad version] VP1 + VP2 - VP1 - VP2 + VP1 - VP2 -
Out of these 4 possibilities 2 x + occurs once: p = 1/4 = x + occurs twice: p = 2/4 = x + occurs once: p = 1/4 = 0.25
Tree structure for 4 Ss S1+- S S S Now there are 16 possibilities in all: 2 4
16 possibilies are distributed as follows: No of +Fraction Probability 41/ / / / /
For 8 Ss No. of +FractionProbability 8 1/ / / / / / / / /
p 7 out of 8 informants said, they preferred the ‘sad’ version of the story best. The probability of which we now know: p = p = Probability, varying between 0 (NEVER happens) and 1 (ALWAYS happens). Placing the comma 2 places to right = % p = means: 3,1 % probability = the probability, that the results came about by random errors! I.e. error probability. Which must be as low as possible!
Because random errors will NEVER go away p means the probability that we falsely conclude that the IV had an effect on the DV How certain are we therefore? ,1 = 96,9% When we repeat this experiment 100 x, we will on average find the same results 96,9 times. In such a situation it is allowed to say that the ending of a story has an effect on readers’ preference.
graphically! Distribution of the number of + when only random errors are at stake. both 0 + and 8 + are rare (p = 0.004; = 0,4 %) also 1 + and 7 + (p = 0.031) How low must p be? No ultimate answer because random errors remain!
H 0 vs. H a We are testing a hypothesis. Usually a hypotheses of difference (between groups) = H a, the alternative hypothesis. Alternative to its logical opposite is the hypothesis of no difference = H 0 (‘null hypothesis). We try to REJECT H a If not successful, we reject the null hypothesis = H o But watch out: in science, we have to be cautious!
Alpha ( ) choose a ‘Significance level’ (= ) high: high probability to make a Type 1 error : to conclude that the IV had an effect on the DV, when in reality it did not. small: high chance to make a Type 2 error: accept H 0 although it is wrong.
Error types and alpha + : Type 1 Error: we falsely accept the H a (we think the IV had an influence, but it does not) - : Type 2 Error: we falsely accept the H 0 (we think the IV had no influence, but it does)
Decision matrix H 0 is trueH 0 is false Fail to reject H 0 correct decisionType 2 error Reject H 0 Type 1 errorcorrect decision
However, Since we have no means of knowing whether the H 0 is really true of false All we can do is to reduce the uncertainty of our decision. And thereby reduce the chance of making a Type I error. There are no certainties, only probabilities in statistics.
Compare to a case in court When very weak evidence for a crime is accepted by a court of law, then a lot of (innocent) people are going to be convicted. If a court accepts only the strongest form of evidence, then a lot of criminals will get free without a conviction. So … some kind of balance is needed. And this balance can best be provided if you know a bit about statistics.
A memory enhancing drug We select 100 students, 50 of which get the drug, the other 50 a placebo (without they themselves knowing who got what!) We then give them some exam which is heavily dependent on memory. Results are scored by examiners who do not know which student got which pill. This is called a double blind design: Neither observer nor observed know who is who.
How big must the difference be? This is a somewhat misleading question. It is like asking “How tall must you be to become a good basketball-player?” Well, ideally as tall as possible. But there is not clear-cut height below which you cannot dream of it. So it is a gliding scale. So it is with p-values: the lower the better! But statisticians have established a conventional level: p <.05
But does this mean that p =.049 is significant, while p =.051 is not? That is to fundamentally misunderstand the nature of p-values. The criterion of.05 is merely a convention. The lower it is, the more confident we are that we may reject the H 0. If that level is marginally about the.05 criterion, it does not mean that the H a has no plausibility. It is exactly this gliding scale that makes significance values so informative. BTW:.05 means one in twenty!
Within this range = probability: 95 % of all observations. Outside this range remains 5 % of all observations. This is the level of random errors we are ready to accept. Here we say that the IV had an effect on the DV Therefore we reject the H 0
Since we know about the normal distribution We know that 68 % of all values lie within 1 SD of the mean 95 % between 2 SD. Mean = 4.00, SD = SD – I.e. between 1.12 and Our observations lie outside this range. Hence: significant!
Region of rejection p < 0.05 = ‘significant’ / p < 0.01 = ‘highly significant’/ p < = ‘very highly significant’ Significance level leads to separation between: 1) area where only random errors had an effect 2) where the IV had an effect on the DV (the critical region) Where we reject the H 0
NB A significant difference does NOT imply a value judgment It merely tells us how likely the results are due to chance. Whether this leads to any change (for instance in instruction methods, a new medicine, etc.) has to be decided on other than statistical grounds E.g. how much does it cost (in time, money, learning curve,…), what the consequences are of not changing anything, etc.
Comparison of means In general: between E- and C-groups or between 2 E-groups to compare both: certain statistical techniques (=Tests --> Test-statistics = Inferential statistics) Matrix with measurement level + ? Normal distribution + type of sample (independent / dependent) Better still: decision chart (see Scientific Methods for the Humanities, p. 231-)
Levels of measurement Nominal: putting a variable into a category, e.g. gender, place of living, political preference, etc. Ordinal: these are ordered categories, e.g. education level, EFL proficiency level, preference of musical composer, price of a car, …. [lacking is the distance between ranks: is the 2 nd composer only half as good as the first one? And the 3 rd one?] Interval: scaled order, with equal distances. Ratio: likewise, but now with a zero-point. E.g. age, divide a 100 points among 4 authors,…
Three possibilities 1.The means of the samples differ 2.The variance of the samples differs 3.Both the mean and the variance differ In each case, we apply statistical tests to estimate the significance of the differences. When p is below the conventional level of 5 % (error probability), we accept that the sample differences may be generalized to the population.
Variables Attributes, characteristics, qualities, etc. E.g. Gender, age, nationality, but above all: ‘treatment’ (what you think exerts an influence). = independent variable. ‘reactions to the treatment (what you expect the influence will be). = dependent variable. The IV causes the DV, the DV is caused by the IV.
Kinds of tests T-test: 1 IV, 1 DV ANOVA: 1 IV, >1 DV MANOVA: > 1 IV, > 1 DV (= GLM) But these are parametric tests: they presuppose that your data are normally distributed, and at least in interval measurement. How to know whether my data follow a normal distribution?
The Kolmogorov-Smirnov test This test takes an ideal distribution And projects your distribution on it And then gauges whether the two differ significantly from each other. ‘Significance’ here in the statistical sense! Meaning: the error probability <.05 Or, in the table of the results: p <.05. In that case, your data are NOT normally distributed!
What to do in that case? The parametric tests assume a number of things (i.e. interval measurement, normal distribution, etc.) When these assumptions are not fulfilled: use non-parametric tests! 2 independent / dependent samples k [ > 2] independent / dependent samples Independent samples: no overlap between the samples. Dependent samples: the same people.
One or two-tailed? H a says only THAT there will be a difference between 2 groups (without direction of the difference) = two-tailed. WITH a direction (e.g.: E > C) = one-tailed. In case of one-tailed: divide p-values by 2. But note that this is a controversial issue among statisticians. E.g.: if you know the direction of the hypothesis, then why do you need to test for significance?
F-ratio (Between-groups means) 2 F = (Within-groups mean) 2 When H 0, F = 1 F > 1 means an effect Very high F-values mean very low p-values! I.e. very unlikely the result of chance! Hence accept H a